Reputation: 675
There are a bunch of questions that get at extracting a particular sentence that contains a word (like extract a sentence using python and Python extract sentence containing word), and I have enough beginner experience with NLTK and SciPy to be able to do that on my own.
However, I'm getting stuck trying to extract a sentence containing a word... as well as the sentences before and after the target sentence.
For example:
"I was walking along to school the other day, when it began to rain. I reached for my umbrella, but I realized I had forgotten it at home. What could I do? I immediately ran for the nearest tree. But then I realized I couldn't stay try with a tree without any leaves."
In this example, the target word is "could." If I wanted to extract the target sentence (What could I do?) as well as the preceding and following sentences (I reached for my umbrella, but I realized I had forgotten it at home. and I immediately ran for the nearest tree.), what would be a good approach?
Assume I have each paragraph sectioned off as its own text...
for paragraph in document:
do something
... is there a proper way to tackle this question? I have about 10,000 paragraphs with varying numbers of sentences around the target word (which appears is every single paragraph).
Upvotes: 2
Views: 6819
Reputation: 24139
What about something like this?
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
for paragraph in document:
paragraph_sentence_list = tokenizer.tokenize(paragraph)
for line in xrange(0,len(paragraph_sentence_list)):
if 'could' in paragraph_sentence_list[line]:
print(paragraph_sentence_list[line])
try:
print(paragraph_sentence_list[line-1])
except IndexError as e:
print('Edge of paragraph. Beginning.')
pass
try:
print(paragraph_sentence_list[line+1])
except IndexError as e:
print('Edge of paragraph. End.')
pass
What this does is break the paragraphs into a list of sentences.
The iterating over the sentences tests if 'could' is in the setence. If it is, then it prints the previous index [line-1], the current index [line] and the next index [line+1]
Upvotes: 4
Reputation: 122270
Make use of sent_tokenize
to extract sentences from raw corpus and then word_tokenize
to tokenize the sentences and then extract the sentences with "could":
>>> from nltk.corpus import brown
>>> from nltk import sent_tokenize, word_tokenize
>>> corpus = " ".join(brown.words())
>>> [i for i in sent_tokenize(corpus) if u"could" in word_tokenize(i)]
To get the sentence before and after:
>>> sentences = sent_tokenize(corpus)
>>> [" ".join([sentences[i-1], j, sentences[i+1]]) for i,j in enumerate(sentences) if u"could" in word_tokenize(j)]
Upvotes: 4