user1592380
user1592380

Reputation: 36307

Is it possible to mix literal words and tags in NLTK regex

I'm experimenting with NLTK to help me parse some text. As an example I have:

1 Robins Drive owned by Gregg S. Smith was sold to TeStER, LLC of 494 Bridge Avenue, Suite 101-308, Sheltville AZ 02997 for $27,000.00.

using:

words =pos_tag(word_tokenize(sentence))

I get:

[('1', 'CD'), ('Robins', 'NNP'), ('Drive', 'NNP'), ('owned', 'VBN'), ('by', 'IN'), ('Gregg', 'NNP'), ('S.', 'NNP'), ('Smith', 'NNP'), ('was', 'VBD'), ('sold', 'VBN'), ('to', 'TO'), ('TeStER', 'NNP'), (',', ','), ('LLC', 'NNP'), ('of', 'IN'), ('494', 'CD'), ('Bridge', 'NNP'), ('Avenue', 'NNP'), (',', ','), ('Suite', 'NNP'), ('101-308', 'CD'), (',', ','), ('Sheltville', 'NNP'), ('AZ', 'NNP'), ('02997', 'CD'), ('for', 'IN'), ('$', '$'), ('27,000.00', 'CD'), ('.', '.')]

Assuming I want to extract the role of 'owner' (Gregg S. Smith), Is there a way to mix and match literals and tags perhaps of a format something like:

'owned by{<NP>+}'

There was a previous discussion of this at Mixing words and PoS tags in NLTK parser grammars, but I'm not sure I understood the provided answer. Is this possible, and if so could you provide a code example.

Upvotes: 0

Views: 462

Answers (1)

nmlq
nmlq

Reputation: 3154

if you combine each word and tag and then use RegEx to look for certain sequences of PoS tags you can get the results you are looking for.

for example, using the words variable you have defined

joined = ' '.join([w+"<"+t+">" for w,t in words])

would produce

'1<CD> Robins<NNP> Drive<NNP> owned<VBN> by<IN> Gregg<NNP> S.<NNP> Smith<NNP> was<VBD> sold<VBN> to<TO> TeStER<NNP> ,<,> LLC<NNP> of<IN> 494<CD> Bridge<NNP> Avenue<NNP> ,<,> Suite<NNP> 101-308<CD> ,<,> Sheltville<NNP> AZ<NNP> 02997<CD> for<IN> $<$> 27,000.00<CD> .<.>'

Then you have to create a regular expression to find the sequence you are looking for depending on word/tag context.

For example, using the python RegEx module re

>>> import re
>>> m = re.match(r'.*owned<VBN> by<IN>.*?<NNP>', joined)
>>> m.group(0)
'1<CD> Robins<NNP> Drive<NNP> owned<VBN> by<IN> Gregg<NNP>'

Upvotes: 2

Related Questions