Finally some success after an year of research in trying to teach a computer how to understand a question like a human.

Like I always believed, there is a way to represent each and every natural language sentence using mathematical notations. Its been an year since I have been playing around with various AI algorithms and after countless failed attempts today I made some progress.

A sentence can be divided into three parts as explained in Subject-Verb-Object. So a good AI who can understand a user must be able to parse a sentence or a question into these three categories.

This is what I discovered:

  1. Identify the types of words that does not have any ambiguous meaning ex: in, for, but, what, why , the, an etc.
  2. Replace the remaining words or nouns in the sentence with 'α', 'β', 'γ', 'δ', 'ε', 'ζ', 'η', 'θ' and so on  as they appear in a sentence.
  3. Remove all determinants like The, An, A 
  4. Replace some of the identified keywords with mathematical notations. ex: of becomes  ∈ (I am yet to identify notations for the rest of the keywords)
  5. Replace all verbs (action words) with  Δ
  6. When you run this algorithm, all samples of sentences show a similar pattern and just be replacing the variables you can identify the object, subject and verbs.

Examples:

Question: What Polynesian people inhabit New Zealand ?

The above question after running through the code becomes

{
        "symbol": "What α  Δ  β",
        "sentence": "What Polynesian people inhabit New Zealand ?",
        "processed": " What α  inhabit β ",
        "qtype": "what",
        "α": "Polynesian people",
        "β": "New Zealand"
}

As you can see in the output above, our question What Polynesian people inhabit New Zealand can be represented symbolically as What α Δ β 

The equations of type α Δ β implies α (subject), Δ (verb), β (object). So a smart bot needs to look up for New Zealand and search for the words “Polynesian people” and “inhabit”.

Few more sample outputs:

[
  {
    "symbol": "What α  Δ  β",
    "sentence": "What actor first portrayed James Bond ?",
    "processed": " What α  portrayed β ",
    "qtype": "what",
    "α": "actor first",
    "β": "James Bond ?"
  },
  {
    "symbol": "What α  Δ  β",
    "sentence": "What Soviet leader owned a Rolls-Royce ?",
    "processed": " What α  owned β ",
    "qtype": "what",
    "α": "Soviet leader",
    "β": "Rolls-Royce ?"
  },
  {
    "symbol": "What α  Δ  β",
    "sentence": "What crop failure caused the Irish Famine ?",
    "processed": " What α  caused β ",
    "qtype": "what",
    "α": "crop failure",
    "β": "Irish Famine ?"
  },
  {
    "symbol": "What α  Δ  β",
    "sentence": "What country 's people are the top television watchers ?",
    "processed": " What α  are β ",
    "qtype": "what",
    "α": "country ' s people",
    "β": "top television watchers ?"
  },
  {
    "symbol": "Which α  Δ  β",
    "sentence": "Which NBA players had jersey number 0 ?",
    "processed": " Which α  had β ",
    "qtype": "which",
    "α": "NBA players",
    "β": "jersey number 0 ?"
  },
  {
    "symbol": "Which α  Δ  β",
    "sentence": "Which country did Hitler rule ?",
    "processed": " Which α  did β ",
    "qtype": "which",
    "α": "country",
    "β": "Hitler rule ?"
  },
  {
    "symbol": "Which α  Δ  β",
    "sentence": "Which language has the most words ?",
    "processed": " Which α  has β ",
    "qtype": "which",
    "α": "language",
    "β": "most words ?"
  }
]

 

There are some sample outputs which are of type: α Δ β in γ

{
        "symbol": "Which α  Δ  β in γ",
        "sentence": "Which Ventura County police department seized the largest cocaine shipment in it 's history ?",
        "processed": " Which α  seized β  in γ ",
        "qtype": "which",
        "α": "Ventura County police department",
        "β": "largest cocaine shipment",
        "γ": "it ' s history ?"
 },
{
        "symbol": "Which α  Δ  β in γ",
        "sentence": "Which cats pursued Tweety Pie in his first cartoon appearance ?",
        "processed": " Which α  pursued β  in γ ",
        "qtype": "which",
        "α": "cats",
        "β": "Tweety Pie",
        "γ": "his first cartoon appearance ?"
 },
 {
        "symbol": "Which α  Δ  β in γ",
        "sentence": "Which team won the Super Bowl in 1968 ?",
        "processed": " Which α  won β  in γ ",
        "qtype": "which",
        "α": "team",
        "β": "Super Bowl",
        "γ": "1968 ?"
}

The formula still works but has an extra parameter γ which gives more meaning to our object.

After running my program on more than 5000 questions, I have around 100 different equations. While most of them fall into the above 3 types but there are few much longer like α Δ β as γ with δ and α ∈ β in which γ is Δ without δ which I am yet to equate to SVO.