Is it possible to mix literal words and tags in NLTK regex

Question

I'm experimenting with NLTK to help me parse some text. As an example I have:

1 Robins Drive owned by Gregg S. Smith was sold to TeStER, LLC of 494 Bridge Avenue, Suite 101-308, Sheltville AZ 02997 for $27,000.00.

using:

words =pos_tag(word_tokenize(sentence))

I get:

[('1', 'CD'), ('Robins', 'NNP'), ('Drive', 'NNP'), ('owned', 'VBN'), ('by', 'IN'), ('Gregg', 'NNP'), ('S.', 'NNP'), ('Smith', 'NNP'), ('was', 'VBD'), ('sold', 'VBN'), ('to', 'TO'), ('TeStER', 'NNP'), (',', ','), ('LLC', 'NNP'), ('of', 'IN'), ('494', 'CD'), ('Bridge', 'NNP'), ('Avenue', 'NNP'), (',', ','), ('Suite', 'NNP'), ('101-308', 'CD'), (',', ','), ('Sheltville', 'NNP'), ('AZ', 'NNP'), ('02997', 'CD'), ('for', 'IN'), ('$', '$'), ('27,000.00', 'CD'), ('.', '.')]

Assuming I want to extract the role of 'owner' (Gregg S. Smith), Is there a way to mix and match literals and tags perhaps of a format something like:

'owned by{+}'

There was a previous discussion of this at Mixing words and PoS tags in NLTK parser grammars, but I'm not sure I understood the provided answer. Is this possible, and if so could you provide a code example.

nmlq · Accepted Answer

if you combine each word and tag and then use RegEx to look for certain sequences of PoS tags you can get the results you are looking for.

for example, using the words variable you have defined

joined = ' '.join([w+"<"+t+">" for w,t in words])

would produce

'1 Robins Drive owned by Gregg S. Smith was sold to TeStER ,<,> LLC of 494 Bridge Avenue ,<,> Suite 101-308 ,<,> Sheltville AZ 02997 for $<$> 27,000.00 .<.>'

Then you have to create a regular expression to find the sequence you are looking for depending on word/tag context.

For example, using the python RegEx module re

>>> import re
>>> m = re.match(r'.*owned by.*?', joined)
>>> m.group(0)
'1 Robins Drive owned by Gregg'

Is it possible to mix literal words and tags in NLTK regex

Answers (1)

Related Questions