Reputation: 5153
I have an application where NLTK needs to interpret speech delivered by humans, and find meaningful chunks from it. The sentence which needs to be interpreted is of the form from <somewhere>, to <somewhere> on <some_date>, <class_of_travel,like AC_CHAIR_CAR>
. As you understand, this can be expressed in myriads of ways, for example,
I want to go to New York from Atlanta, business class, 25th July 2014.
I want to travel via business class, to Atlanta on 25th July from New York.
I have a dream that I will one day board a plane, travel in business class, descend at New York, the source being at Atlanta, preferably on 25th July.
25th July Atlanta to New York, business class.
You get the idea. What I want to extract are few tidbits of information - source, destination, class, date. Some may be missing, which have to be identified, or appropriately assumed. Like if the source is found missing, identify that. Or if the year is missing, chalk it up to the current year. And all the while ignore the useless information (like the I have a dream part, much as I adore Martin Luther).
Is there any way I can achieve this in NLTK? I am aware that there are taggers available, and there are ways to train taggers, but I don't have sufficient knowledge on that. Is it possible to cover more or less all possible cases that can mean such a sentence, and extract the information like this? If so, a little guidance would be appreciated.
Upvotes: 1
Views: 2124
Reputation:
In computational linguistics, this is known as “Named Entity Recognition”, it's the process of identifying things like organisations, people and locations from text.
The challenge here is that the default NE chunker in nltk is a maximum entropy chunker trained on the ACE corpus. It has not been trained to recognise dates and times, so you need to tweak it and find a way to detect time.
There are some packages that helps extract Named Entities, Stanford NER (Named Entity Recognizer) is one of the most popular Named Entity Recognition tools and implemented by Java. But you can use it by downloading the package, and interacting through NLTK that provided an interface of Stanford NER.
You can download Stanford Named Entity Recognizer version 3.4 where you find The stanford-ner.jar and classifier modle “all.3class.distsim.crf.ser.gz”
from nltk.tag.stanford import NERTagger
def stanfordNERExtractor(sentence):
st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
'/usr/share/stanford-ner/stanford-ner.jar')
return st.tag(sentence.split())
stanfordNERExtractedLines = stanfordNERExtractor("New York")
print stanfordNERExtractedLines #[('New-York', 'LOCATION')]
You can also use NTLK, you find more details on the official document, check this gist from Gavin
def extract_entities(text):
for sent in nltk.sent_tokenize(text):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'node'):
print chunk.node, ' '.join(c[0] for c in chunk.leaves())
extract_entities("to play to Atlanta")
#Output: [('to', 'TO'),('play', 'VB'),('to', 'TO'),('play', 'NN')],
It's preferred to write a regular expression pattern to identify the source and the destination. You might have problems getting other words like "to get"
, but you have the list of locations identified to verify from st.tag
("LOCATION"), or in case you used the NTLK, you can verify whether it's a verb ("VB"/"NN"). You can also check possibilities by using NLTK’s UnigramTagger() and BigramTagger() to get names after “FROM” and “TO” that could be identified as locations
import re text= "I want to go to New York from Atlanta, business class, on 25th July." destination= re.findall(r'.to.([A-Z][a-zA-Z]+?[\s-]*[A-Z]*[a-zA-Z]*)',text) source= re.findall(r'.from.([A-Z][a-zA-Z]+?[\s-]*[A-Z]*[a-zA-Z]*)',text) print source,destination
As mentioned above, this is one of the problems that we can face, but we can use regular expression, as mentioned in this thread.
print re.findall(
r"""(?ix) # case-insensitive, verbose regex
\b # match a word boundary
(?: # match the following three times:
(?: # either
\d+ # a number,
(?:\.|st|nd|rd|th)* # followed by a dot, st, nd, rd, or th (optional)
| # or a month name
(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)
)
[\s./-]* # followed by a date separator or whitespace (optional)
){3} # do this three times
\b """,
text)
Output:
25th July 2014.
We can also use python-dateutil or this instead of using regular expression.
In case there is a missing part, like the year, or month. We can tweak that using parsedatetime package.
Check this quick example (You can adapt it based on different scenarios)
>>> import parsedatetime
>>> p = parsedatetime.Calendar()
>>> print p.parse("25th this month")
(time.struct_time(tm_year=2014, tm_mon=11, tm_mday=10, tm_hour=1, tm_min=5, tm_sec=31, tm_wday=0, tm_yday=314, tm_isdst=0), 0)
>>> print p.parse("25th July")
((2015, 7, 25, 1, 5, 50, 0, 314, 0), 1)
>>> print p.parse("25th July 2014")
((2014, 7, 25, 1, 6, 3, 0, 314, 0), 1)
The last thing is, you can use this dataset to extract airoports, and verify the correctness of locations mentioned, in case you are answering with avaibilities (There are some locations where there is no airoport).
For the class, you can verify it by looking at "economic class", "business class" words in the sentence (you have the choice between in
or regular expression).
For more details in this topic, check: NTLK - Extracting Information from Text
Upvotes: 2
Reputation: 9256
This problem is called 'Named Entity Recognition' (or just 'ner'). Googling those phrases should point you towards many libraries, online api's, clever rules of thumb for specific types of data, etc.
Checkout a demo NER system at http://nlp.stanford.edu:8080/ner/
Detecting references to dates and times is probably the case which has the most heuristic-based solutions out there.
If you have a specific and pretty limited domain of text you are working with, then setting up manually curated lists of entities might prove to be very helpful.
e.g. Just make a list of all airport codes/names of all cities that have a commercial airport and try to do exact string matching of those names against any input text.
Upvotes: 2