Reputation: 2294
I have this sentence: 'Johnny Johnny yes papa'
, and I want to calculate the frequency of next word for each word. In this case I turn the sentence into circular:
sentence = 'Johnny Johnny yes papa'
sentence = sentence.split()
sentence.append(sentence[0])
Now the sentence is ['Johnny','Johnny','yes','papa','Johnny']
I create the bigrams in this way:
def to_bigrams(my_list):
bigrams = [(my_list[i],my_list[i+1]) for i,element in enumerate(my_list) if i<len(my_list)-1]
return bigrams
my_bigrams = to_bigrams(sentence)
And now my bigrams are: [('Johnny', 'Johnny'), ('Johnny', 'yes'), ('yes', 'papa'), ('papa', 'Johnny')]
Now for example Johnny
has two outcomes: Johnny
and yes
, and yes
has only one outcome which is papa
and papa
has only one outcome which is Johnny
so the expected dictionary is:
{'Johnny':['Johnny','yes'],'yes':['papa'],'papa':['Johnny']}
I have tried this:
my_freq_dict = {my_bigrams[i][0]:my_bigrams[i][j] for i,element in enumerate(my_bigrams) for j in range(len(my_bigrams))}
But I get this error: IndexError: tuple index out of range
. There is something wrong with my logic, please, could you help me?
Upvotes: 1
Views: 92
Reputation: 9061
You can use itertools.groupby
from itertools import groupby
res = {key: [x[1] for x in group]for key, group in groupby(sorted(data, key=lambda x: x[0]), key= lambda x: x[0])}
print(res)
Output:
{'Johnny': ['Johnny', 'yes'], 'yes': ['papa'], 'papa': ['Johnny']}
Upvotes: 1
Reputation: 29742
One way using dict.setdefault
:
my_bigrams = [('Johnny', 'Johnny'), ('Johnny', 'yes'), ('yes', 'papa'), ('papa', 'Johnny')]
d = {}
for v1, v2 in my_bigrams:
d.setdefault(v1, []).append(v2)
d
Output:
{'Johnny': ['Johnny', 'yes'], 'yes': ['papa'], 'papa': ['Johnny']}
Your try is creating error because you are using len(my_bigrams)
instead of len(element)
.
Fixing it, however, won't yield the expected output since some keys appear more than once and thus will be overwritten by the latest entry (which is what dict
is meant to do).
Upvotes: 1