Reputation: 752
Is there a way (Pattern or Python or NLTK, etc) to detect of a sentence has a list of words in it.
i.e.
The cat ran into the hat, box, and house.
| The list would be hat, box, and house
This could be string processed but we may have more generic lists:
i.e.
The cat likes to run outside, run inside, or jump up the stairs.
|
List=run outside, run inside, or jump up the stairs.
This could be in the middle of a paragraph or the end of the sentence which further complicates things.
I've been working with Pattern for python for awhile and I'm not seeing a way to go about this and was curious if there is a way with pattern or nltk (natural language tool kit).
Upvotes: 1
Views: 12230
Reputation: 1621
Using a Trie, you will be able to achieve this is O(n)
where n
is the number of words in the list of words after building a trie with the list of words which takes O(n)
where n
is the number of words in the list.
Algorithm
Code
import pygtrie
listOfWords = ['word1', 'word2', 'word3', 'two words']
trie = pygtrie.StringTrie()
trie._separator = ' '
for word in listOfWords:
trie[word] = True
print('s', trie._separator)
sentence = "word1 as word2 a fword3 af two words"
sentence_words = sentence.split()
words_found = {}
extended_words = set()
for possible_word in sentence_words:
has_possible_word = trie.has_node(possible_word)
if has_possible_word & trie.HAS_VALUE:
words_found[possible_word] = True
deep_clone = set(extended_words)
for extended_word in deep_clone:
extended_words.remove(extended_word)
possible_extended_word = extended_word + trie._separator + possible_word
print(possible_extended_word)
has_possible_extended_word = trie.has_node(possible_extended_word)
if has_possible_extended_word & trie.HAS_VALUE:
words_found[possible_extended_word] = True
if has_possible_extended_word & trie.HAS_SUBTRIE:
extended_words.update(possible_extended_word)
if has_possible_word & trie.HAS_SUBTRIE:
extended_words.update([possible_word])
print(words_found)
print(len(words_found) == len(listOfWords))
This is useful if your list of words is huge and you do not wish to iterate over it every time or you have a large number of queries that over the same list of words.
Upvotes: 1
Reputation: 8972
What about using from nltk.tokenize import sent_tokenize
?
sent_tokenize("Hello SF Python. This is NLTK.")
["Hello SF Python.", "This is NLTK."]
Then you can use that list of sentences in this way:
for sentence in my_list:
# test if this sentence contains the words you want
# using all() method
More info here
Upvotes: 0
Reputation: 213193
From what I got from your question, I think you want to search whether all the words in your list is present in a sentence or not.
In general to search for a list elements, in a sentence, you can use all
function. It returns true, if all the arguments in it are true.
listOfWords = ['word1', 'word2', 'word3', 'two words']
sentence = "word1 as word2 a fword3 af two words"
if all(word in sentence for word in listOfWords):
print "All words in sentence"
else:
print "Missing"
OUTPUT: -
"All words in sentence"
I think this might serve your purpose. If not, then you can clarify.
Upvotes: 2