Extract associated values from text using NLP

Question

I want to extract Cardinal(CD) values associated with Units of Measurement and store it in a dictionary. For example if the text contains tokens like "20 kgs", it should extract it and keep it in a dictionary.

Example:

for input text, “10-inch fry pan offers superb heat conductivity and distribution”, the output dictionary should look like, {"dimension":"10-inch"}

for input text, "This bucket holds 5 litres of water.", the output should look like, {"volume": "5 litres"}

line = 'This bucket holds 5 litres of water.'
tokenized = nltk.word_tokenize(line)
tagged = nltk.pos_tag(tokenized)

The above line would give the output:

[('This', 'DT'), ('bucket', 'NN'), ('holds', 'VBZ'), ('5', 'CD'), ('litres', 'NNS'), ('of', 'IN'), ('water', 'NN'), ('.', '.')]

Is there a way to extract the CD and UOM values from the text?

bogs · Accepted Answer

Not sure how flexible you need the process to be. You can play around with nltk.RegexParser and come up with some good patters:

import nltk

sentence = 'This bucket holds 5 litres of water.'

parser = nltk.RegexpParser(
    """
    INDICATOR: {}
    """)

print parser.parse(nltk.pos_tag(nltk.word_tokenize(sentence)))

Output:

(S
  This/DT
  bucket/NN
  holds/VBZ
  (INDICATOR 5/CD litres/NNS)
  of/IN
  water/NN
  ./.)

You can also create a corpus and train a chunker.

Extract associated values from text using NLP

Answers (2)

Related Questions