user1592380
user1592380

Reputation: 36267

Extracting between tokens with NLTK

I'm experimenting with NLTK to help me parse some text. So far , just using the sent_tokenize function has been very helpful in organizing the text. As an example I have:

1 Robins Drive owned by Gregg S. Smith was sold to TeStER, LLC of 494 Bridge Avenue, Suite 101-308, Sheltville AZ 02997 for $27,000.00.

using:

words =pos_tag(word_tokenize(sentence))

I get:

[('1', 'CD'), ('Robins', 'NNP'), ('Drive', 'NNP'), ('owned', 'VBN'), ('by', 'IN'), ('Gregg', 'NNP'), ('S.', 'NNP'), ('Smith', 'NNP'), ('was', 'VBD'), ('sold', 'VBN'), ('to', 'TO'), ('TeStER', 'NNP'), (',', ','), ('LLC', 'NNP'), ('of', 'IN'), ('494', 'CD'), ('Bridge', 'NNP'), ('Avenue', 'NNP'), (',', ','), ('Suite', 'NNP'), ('101-308', 'CD'), (',', ','), ('Sheltville', 'NNP'), ('AZ', 'NNP'), ('02997', 'CD'), ('for', 'IN'), ('$', '$'), ('27,000.00', 'CD'), ('.', '.')]

I have been looking at various tutorials and the book http://www.nltk.org/book/ , but I'm not sure of the best approach to extracting between 2 tokens. For example I want to select the tokens between "owned by" and "was sold to" to get the owner name. how can I best use NLTK functions and python to do this?

Upvotes: 3

Views: 590

Answers (1)

estebanpdl
estebanpdl

Reputation: 1233

This is one solution to select tokens between the words you wrote:

import re
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'owned by(.*?)was sold to')

string = '1 Robins Drive owned by Gregg S. Smith was sold to TeStER, LLC of 494 Bridge Avenue, Suite 101-308, Sheltville AZ 02997 for $27,000.00.'
s = tokenizer.tokenize(string)

returns:

[' Gregg S. Smith ']

Upvotes: 4

Related Questions