Most common bigram words distribution in a data

Question

I got a dataset with utterances text column:

**utterance**
Where Arizona state located?
how to find Arizona state
is the united stated is the biggest country around the world?
Arizona state borders
united stated borders

I would like to get a bigram keyword distribution output:

  Arizona state 3
  United stated 2

This code is for unigram/one word: df.loc['utterances'].explode().value_counts()

How can I do this for bigram?

Meti · Accepted Answer

In case you don't want to perform any pre-process like lowering the characters easy implementation would be like the bellow:

import pandas as pd
from collections import Counter
from functools import reduce
df = pd.DataFrame({'utterances':
                   ['Where Arizona state located?', 'how to find Arizona state', 'is the united stated is the biggest country around the world?', 'Arizona state borders', 'united stated borders']
                   })
df['bigrams'] = df['utterances'].apply(lambda item:Counter([bg for bg in zip(item.split(), item.split()[1:])]))
total = reduce(lambda a, b: a+b , df['bigrams'].to_list())
total.most_common()

output:

[(('Arizona', 'state'), 3),
 (('is', 'the'), 2),
 (('united', 'stated'), 2),
 (('Where', 'Arizona'), 1),
 (('state', 'located?'), 1),
 (('how', 'to'), 1),
 (('to', 'find'), 1),
 (('find', 'Arizona'), 1),
 (('the', 'united'), 1),
 (('stated', 'is'), 1),
 (('the', 'biggest'), 1),
 (('biggest', 'country'), 1),
 (('country', 'around'), 1),
 (('around', 'the'), 1),
 (('the', 'world?'), 1),
 (('state', 'borders'), 1),
 (('stated', 'borders'), 1)]

In case you want to add something more sophisticated, it needs to be done before counting bigrams. Adding trigrams and ... is easy then if you know :)

UPDATE

import pandas as pd
from collections import Counter
from functools import reduce
df = pd.DataFrame({'utterances':
                   ['Where Arizona state located?', 'how to find Arizona state', 'is the united stated is the biggest country around the world?', 'Arizona state borders', 'united stated borders']
                   })
df['bigrams'] = df['utterances'].apply(lambda item:Counter([bg for bg in zip(item.split(), item.split()[1:])]))
df['trigrams'] = df['utterances'].apply(lambda item:Counter([bg for bg in zip(item.split(), item.split()[1:], item.split()[2:])]))

total_bigram = reduce(lambda a, b: a+b , df['bigrams'].to_list())
total_trigram = reduce(lambda a, b: a+b , df['trigrams'].to_list())
print(total_bigram.most_common())
print(total_trigram.most_common())

Most common bigram words distribution in a data

Answers (1)

Related Questions