Reputation: 22051
I want to extract Cardinal(CD) values associated with Units of Measurement and store it in a dictionary. For example if the text contains tokens like "20 kgs", it should extract it and keep it in a dictionary.
Example:
for input text, “10-inch fry pan offers superb heat conductivity and distribution”, the output dictionary should look like, {"dimension":"10-inch"}
for input text, "This bucket holds 5 litres of water.", the output should look like, {"volume": "5 litres"}
line = 'This bucket holds 5 litres of water.'
tokenized = nltk.word_tokenize(line)
tagged = nltk.pos_tag(tokenized)
The above line would give the output:
[('This', 'DT'), ('bucket', 'NN'), ('holds', 'VBZ'), ('5', 'CD'), ('litres', 'NNS'), ('of', 'IN'), ('water', 'NN'), ('.', '.')]
Is there a way to extract the CD and UOM values from the text?
Upvotes: 4
Views: 1992
Reputation: 4618
Hm, not sure if it helps - but I wrote it in Javascript. Here: http://github.com/redaktor/nlp_compromise
It might be a bit undocumented yet but the guys are porting it to a 2.0 branch now.
It should be easy to port to python considering What's different between Python and Javascript regular expressions?
And : Did you check pythons NLTK? : http://www.nltk.org
Upvotes: 1
Reputation: 2296
Not sure how flexible you need the process to be. You can play around with nltk.RegexParser and come up with some good patters:
import nltk
sentence = 'This bucket holds 5 litres of water.'
parser = nltk.RegexpParser(
"""
INDICATOR: {<CD><NNS>}
""")
print parser.parse(nltk.pos_tag(nltk.word_tokenize(sentence)))
Output:
(S
This/DT
bucket/NN
holds/VBZ
(INDICATOR 5/CD litres/NNS)
of/IN
water/NN
./.)
You can also create a corpus and train a chunker.
Upvotes: 2