Reputation: 55
My problem has two parts, here is the first:
I have a list of lists like this:
big_list = [['i have an apple','i have a pear','i am a monkey', 'tell me about apples'],
['tell me about cars','tell me about trucks','tell me about planes']]
I also have a list of words like this:
words = ['i','have','monkey','tell','me','about']
I want to iterate through big_list, checking whether each sublist contains more than one element from words in sequence. For example, big_list[0] would contain 'i have' from the first and second elements and 'tell me about' in the final element.
Currently I am trying this at the level of the sublist where i first tokenize all strings in a sublist so that i can iterate through their elements to see where elements from words occur:
import nltk
example = big_list[0]
example_sentences_tokens = []
for sentence in example:
example_sentences_tokens.append([token.lower() for token in nltk.tokenize.word_tokenize(sentence)])
Having access to both original strings and tokenized strings, I check where elements from words occur:
tuples = []
for sentence, tokenized_sentence in zip(example,example_sentences_tokens):
tuples.append(tuple((sentence,[token for token in example_sentences_tokens if token in words])))
Now, tuples is a list of tuples each containing every sentence from big_list[0] and all elements of said sentence which exist in words.
However, I only want to include tokens existing in words if they occur in sequence, not if they occur alone. How can I do this?
The second part of the problem: Finally, once I've identified all instances where a sequence of elements from words appears together somewhere in big_list, id like to show the frequency of those element sequences in all sublists. So tell me about occurs in 100% of big_list[1] and 33% of big_list[0]. Is there a simple way to show this distribution?
Upvotes: 0
Views: 402
Reputation: 2013
First of all, when testing your code i had to change your tuples
content to actually collect common elements between words
and tokenized_sentence
(all i got was tuples like (sentence, []) otherwise):
tuples.append(((sentence,[token for token in words if token in tokenized_sentence])))
To check that we have 2 or more "matches" in sequence, the solution depends on the meaning of your words
: does their order matter or not?
i.e: if words = ['i','have','monkey','tell','about', 'me']
(not 'me', 'about'), would 'tell me about apples'
still match? My guess is that it would still match, however I will provide you solutions for both cases.
In the case that words
's tokens order matters, you can simply check if the matched tokens, separated by a space, is in the examined sentence:
tuples = []
for sentence, tokenized_sentence in zip(example, example_sentences_tokens):
matches = [token for token in words if token in tokenized_sentence]
sequence = ' '.join(matches) # order of matches matters
if sequence in sentence:
tuples.append(((sentence, matches)))
print(tuples)
Output:
[('i have an apple', ['i', 'have']),
('i have a pear', ['i', 'have'])]
In the case that words
's tokens order doesn't matter, you can take the index of the first matching token, and check if the next one in the tokenized sentence is still part of words
:
tuples = []
for sentence, tokenized_sentence in zip(example, example_sentences_tokens):
#print(sentence, tokenized_sentence)
#print([token for token in words if token in tokenized_sentence])
matches = [token for token in words if token in tokenized_sentence]
i = tokenized_sentence.index(matches[0])
if tokenized_sentence[i+1] in matches:
tuples.append(((sentence,matches)))
print(tuples)
Output:
[('i have an apple', ['i', 'have']),
('i have a pear', ['i', 'have']),
('tell me about apples', ['tell', 'about', 'me'])]
I imagine you will apply the above procedure on each set of sentences in big_list
.
What i suggest is to keep a list of the results in tuples
at each round, along with the index of the list of sentences in big_list
examinanted: in this way you can keep track of all matches combinations, and work your way towards computing the percentanges of occurrences based in the index.
Upvotes: 1