Reputation: 299
I have a list of tuples that looks like this :
[(('review', 'shop', 'draw'), 35),
(('shop', 'drawing', 'review'), 32),
(('field', 'review', 'report'), 24),
(('review', 'shop', 'drawing'), 20),
(('shop', 'drawing', 'return'), 20),
(('shop', 'draw', 'review'), 18),
(('site', 'review', 'report'), 17),
(('respond', 'rfi', 'regard'), 15),
(('review', 'fire', 'alarm'), 11),
(('review', 'lighting', 'shop'), 10)]
and I would like to merge those elements that are similar after stemming them alongside with their counts:
from nltk.stem import PorterStemmer
for elm in trigram_counts:
ngrams = list(elm[0])
stemmed_ngrams = []
for gram in ngrams:
stemmed_ngrams.append(porter.stem(gram))
print(stemmed_ngrams, elm[1])
this gives something like this :
['review', 'shop', 'draw'] 35
['shop', 'draw', 'review'] 32
['field', 'review', 'report'] 24
['review', 'shop', 'draw'] 20
['shop', 'draw', 'return'] 20
['shop', 'draw', 'review'] 18
['site', 'review', 'report'] 17
['respond', 'rfi', 'regard'] 15
['review', 'fire', 'alarm'] 11
['review', 'light', 'shop'] 10
My goal is to merge for example ['review', 'shop', 'draw']
and ['shop', 'draw', 'review']
with their corresponding sum which is 67
I think I'm complicating it with my solution by over looping through all the elements.
Upvotes: 1
Views: 154
Reputation: 1039
Since you want to combine counts from similar stemmed trigrams you can use a dictionary with frozensets as keys: the keys will be the stemmed trigrams and the values will be the total count.
You have to use frozensets instead sets as keys since the keys of dict must be hashable (which is not the case for the sets).
You will have something like this:
from collections import defaultdict
from nltk.stem import PorterStemmer
stemmed_trigram_counts = defaultdict(int) # use defaultdict to avoid to have to check if the key exist
porter = PorterStemmer()
for trigram, count in trigram_counts:
stemmed_trigram = frozenset(porter.stem(word) for word in trigram)
stemmed_trigram_counts[stemmed_trigram] += count
print(stemmed_trigram_counts)
This will give you the following output:
{
frozenset({'draw', 'review', 'shop'}): 105,
frozenset({'field', 'report', 'review'}): 24,
frozenset({'draw', 'return', 'shop'}): 20,
frozenset({'report', 'review', 'site'}): 17,
frozenset({'regard', 'respond', 'rfi'}): 15,
frozenset({'alarm', 'fire', 'review'}): 11,
frozenset({'light', 'review', 'shop'}): 10
}
Remark: in case the order matter, you should use tuples instead of frozensets
Upvotes: 2