Reputation: 36307
I'm experimenting with NLTK to help me parse some text. As an example I have:
1 Robins Drive owned by Gregg S. Smith was sold to TeStER, LLC of 494 Bridge Avenue, Suite 101-308, Sheltville AZ 02997 for $27,000.00.
using:
words =pos_tag(word_tokenize(sentence))
I get:
[('1', 'CD'), ('Robins', 'NNP'), ('Drive', 'NNP'), ('owned', 'VBN'), ('by', 'IN'), ('Gregg', 'NNP'), ('S.', 'NNP'), ('Smith', 'NNP'), ('was', 'VBD'), ('sold', 'VBN'), ('to', 'TO'), ('TeStER', 'NNP'), (',', ','), ('LLC', 'NNP'), ('of', 'IN'), ('494', 'CD'), ('Bridge', 'NNP'), ('Avenue', 'NNP'), (',', ','), ('Suite', 'NNP'), ('101-308', 'CD'), (',', ','), ('Sheltville', 'NNP'), ('AZ', 'NNP'), ('02997', 'CD'), ('for', 'IN'), ('$', '$'), ('27,000.00', 'CD'), ('.', '.')]
Assuming I want to extract the role of 'owner' (Gregg S. Smith), Is there a way to mix and match literals and tags perhaps of a format something like:
'owned by{<NP>+}'
There was a previous discussion of this at Mixing words and PoS tags in NLTK parser grammars, but I'm not sure I understood the provided answer. Is this possible, and if so could you provide a code example.
Upvotes: 0
Views: 462
Reputation: 3154
if you combine each word and tag and then use RegEx
to look for certain sequences of PoS tags you can get the results you are looking for.
for example, using the words
variable you have defined
joined = ' '.join([w+"<"+t+">" for w,t in words])
would produce
'1<CD> Robins<NNP> Drive<NNP> owned<VBN> by<IN> Gregg<NNP> S.<NNP> Smith<NNP> was<VBD> sold<VBN> to<TO> TeStER<NNP> ,<,> LLC<NNP> of<IN> 494<CD> Bridge<NNP> Avenue<NNP> ,<,> Suite<NNP> 101-308<CD> ,<,> Sheltville<NNP> AZ<NNP> 02997<CD> for<IN> $<$> 27,000.00<CD> .<.>'
Then you have to create a regular expression to find the sequence you are looking for depending on word/tag context.
For example, using the python RegEx module re
>>> import re
>>> m = re.match(r'.*owned<VBN> by<IN>.*?<NNP>', joined)
>>> m.group(0)
'1<CD> Robins<NNP> Drive<NNP> owned<VBN> by<IN> Gregg<NNP>'
Upvotes: 2