Alexis
Alexis

Reputation: 2294

Frequency and next words for a word of a bigram list in python

I have this sentence: 'Johnny Johnny yes papa', and I want to calculate the frequency of next word for each word. In this case I turn the sentence into circular:

sentence = 'Johnny Johnny yes papa'
sentence = sentence.split()
sentence.append(sentence[0])

Now the sentence is ['Johnny','Johnny','yes','papa','Johnny']

I create the bigrams in this way:

def to_bigrams(my_list):
  bigrams = [(my_list[i],my_list[i+1]) for i,element in enumerate(my_list) if i<len(my_list)-1]
  return bigrams

my_bigrams = to_bigrams(sentence)

And now my bigrams are: [('Johnny', 'Johnny'), ('Johnny', 'yes'), ('yes', 'papa'), ('papa', 'Johnny')]

Now for example Johnny has two outcomes: Johnny and yes, and yes has only one outcome which is papa and papa has only one outcome which is Johnny so the expected dictionary is:

{'Johnny':['Johnny','yes'],'yes':['papa'],'papa':['Johnny']}

I have tried this:

my_freq_dict = {my_bigrams[i][0]:my_bigrams[i][j] for i,element in enumerate(my_bigrams) for j in range(len(my_bigrams))}

But I get this error: IndexError: tuple index out of range. There is something wrong with my logic, please, could you help me?

Upvotes: 1

Views: 92

Answers (2)

deadshot
deadshot

Reputation: 9061

You can use itertools.groupby

from itertools import groupby

res = {key: [x[1] for x in group]for key, group in groupby(sorted(data, key=lambda x: x[0]), key= lambda x: x[0])}
print(res)

Output:

{'Johnny': ['Johnny', 'yes'], 'yes': ['papa'], 'papa': ['Johnny']}

Upvotes: 1

Chris
Chris

Reputation: 29742

One way using dict.setdefault:

my_bigrams = [('Johnny', 'Johnny'), ('Johnny', 'yes'), ('yes', 'papa'), ('papa', 'Johnny')]

d = {}
for v1, v2 in my_bigrams:
    d.setdefault(v1, []).append(v2)
d

Output:

{'Johnny': ['Johnny', 'yes'], 'yes': ['papa'], 'papa': ['Johnny']}

Your try is creating error because you are using len(my_bigrams) instead of len(element).

Fixing it, however, won't yield the expected output since some keys appear more than once and thus will be overwritten by the latest entry (which is what dict is meant to do).

Upvotes: 1

Related Questions