Python NLTK interpret a fixed pattern of sentence and tokenize it

Question

I have an application where NLTK needs to interpret speech delivered by humans, and find meaningful chunks from it. The sentence which needs to be interpreted is of the form from , to on , . As you understand, this can be expressed in myriads of ways, for example,

I want to go to New York from Atlanta, business class, 25th July 2014.
I want to travel via business class, to Atlanta on 25th July from New York.
I have a dream that I will one day board a plane, travel in business class, descend at New York, the source being at Atlanta, preferably on 25th July.
25th July Atlanta to New York, business class.

You get the idea. What I want to extract are few tidbits of information - source, destination, class, date. Some may be missing, which have to be identified, or appropriately assumed. Like if the source is found missing, identify that. Or if the year is missing, chalk it up to the current year. And all the while ignore the useless information (like the I have a dream part, much as I adore Martin Luther).

Is there any way I can achieve this in NLTK? I am aware that there are taggers available, and there are ways to train taggers, but I don't have sufficient knowledge on that. Is it possible to cover more or less all possible cases that can mean such a sentence, and extract the information like this? If so, a little guidance would be appreciated.

user4179775 · Accepted Answer

In computational linguistics, this is known as “Named Entity Recognition”, it's the process of identifying things like organisations, people and locations from text.

The challenge here is that the default NE chunker in nltk is a maximum entropy chunker trained on the ACE corpus. It has not been trained to recognise dates and times, so you need to tweak it and find a way to detect time.

There are some packages that helps extract Named Entities, Stanford NER (Named Entity Recognizer) is one of the most popular Named Entity Recognition tools and implemented by Java. But you can use it by downloading the package, and interacting through NLTK that provided an interface of Stanford NER.

You can download Stanford Named Entity Recognizer version 3.4 where you find The stanford-ner.jar and classifier modle “all.3class.distsim.crf.ser.gz”

from nltk.tag.stanford import NERTagger
def stanfordNERExtractor(sentence):
    st =  NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
               '/usr/share/stanford-ner/stanford-ner.jar')
    return st.tag(sentence.split()) 

stanfordNERExtractedLines = stanfordNERExtractor("New York")
print stanfordNERExtractedLines #[('New-York', 'LOCATION')]

You can also use NTLK, you find more details on the official document, check this gist from Gavin

def extract_entities(text):
    for sent in nltk.sent_tokenize(text):
        for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
            if hasattr(chunk, 'node'):
                print chunk.node, ' '.join(c[0] for c in chunk.leaves())

extract_entities("to play to Atlanta")

#Output: [('to', 'TO'),('play', 'VB'),('to', 'TO'),('play', 'NN')],

How can we identify the destination ? After distinguishing between locations, you might face problems recognize words that are separated by a space, or distinguish between source and distinction.

It's preferred to write a regular expression pattern to identify the source and the destination. You might have problems getting other words like "to get", but you have the list of locations identified to verify from st.tag ("LOCATION"), or in case you used the NTLK, you can verify whether it's a verb ("VB"/"NN"). You can also check possibilities by using NLTK’s UnigramTagger() and BigramTagger() to get names after “FROM” and “TO” that could be identified as locations

import re
text= "I want to go to New York from Atlanta, business class, on 25th July."
destination= re.findall(r'.to.([A-Z][a-zA-Z]+?[\s-]*[A-Z]*[a-zA-Z]*)',text)
source= re.findall(r'.from.([A-Z][a-zA-Z]+?[\s-]*[A-Z]*[a-zA-Z]*)',text)

print source,destination

How can we identify the time/date?

As mentioned above, this is one of the problems that we can face, but we can use regular expression, as mentioned in this thread.

print re.findall(
    r"""(?ix)             # case-insensitive, verbose regex
    \b                    # match a word boundary
    (?:                   # match the following three times:
     (?:                  # either
      \d+                 # a number,
      (?:\.|st|nd|rd|th)* # followed by a dot, st, nd, rd, or th (optional)
      |                   # or a month name
      (?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)
     )
     [\s./-]*             # followed by a date separator or whitespace (optional)
    ){3}                  # do this three times
    \b """, 
    text)

Output:

25th July 2014.

We can also use python-dateutil or this instead of using regular expression.

In case there is a missing part, like the year, or month. We can tweak that using parsedatetime package.

Check this quick example (You can adapt it based on different scenarios)

>>> import parsedatetime
>>> p = parsedatetime.Calendar()
>>> print p.parse("25th this month")
(time.struct_time(tm_year=2014, tm_mon=11, tm_mday=10, tm_hour=1, tm_min=5, tm_sec=31, tm_wday=0, tm_yday=314, tm_isdst=0), 0)
>>> print p.parse("25th July")
((2015, 7, 25, 1, 5, 50, 0, 314, 0), 1)
>>> print p.parse("25th July 2014")
((2014, 7, 25, 1, 6, 3, 0, 314, 0), 1)

The last thing is, you can use this dataset to extract airoports, and verify the correctness of locations mentioned, in case you are answering with avaibilities (There are some locations where there is no airoport).

For the class, you can verify it by looking at "economic class", "business class" words in the sentence (you have the choice between in or regular expression).

For more details in this topic, check: NTLK - Extracting Information from Text

Python NLTK interpret a fixed pattern of sentence and tokenize it

Answers (2)

Related Questions