Reputation: 73
I got a dataset with utterances text column:
**utterance**
Where Arizona state located?
how to find Arizona state
is the united stated is the biggest country around the world?
Arizona state borders
united stated borders
I would like to get a bigram keyword distribution output:
Arizona state 3
United stated 2
This code is for unigram/one word:
df.loc['utterances'].explode().value_counts()
How can I do this for bigram?
Upvotes: 0
Views: 292
Reputation: 2056
In case you don't want to perform any pre-process like lowering the characters easy implementation would be like the bellow:
import pandas as pd
from collections import Counter
from functools import reduce
df = pd.DataFrame({'utterances':
['Where Arizona state located?', 'how to find Arizona state', 'is the united stated is the biggest country around the world?', 'Arizona state borders', 'united stated borders']
})
df['bigrams'] = df['utterances'].apply(lambda item:Counter([bg for bg in zip(item.split(), item.split()[1:])]))
total = reduce(lambda a, b: a+b , df['bigrams'].to_list())
total.most_common()
output:
[(('Arizona', 'state'), 3),
(('is', 'the'), 2),
(('united', 'stated'), 2),
(('Where', 'Arizona'), 1),
(('state', 'located?'), 1),
(('how', 'to'), 1),
(('to', 'find'), 1),
(('find', 'Arizona'), 1),
(('the', 'united'), 1),
(('stated', 'is'), 1),
(('the', 'biggest'), 1),
(('biggest', 'country'), 1),
(('country', 'around'), 1),
(('around', 'the'), 1),
(('the', 'world?'), 1),
(('state', 'borders'), 1),
(('stated', 'borders'), 1)]
In case you want to add something more sophisticated, it needs to be done before counting bigrams. Adding trigrams and ... is easy then if you know :)
UPDATE
import pandas as pd
from collections import Counter
from functools import reduce
df = pd.DataFrame({'utterances':
['Where Arizona state located?', 'how to find Arizona state', 'is the united stated is the biggest country around the world?', 'Arizona state borders', 'united stated borders']
})
df['bigrams'] = df['utterances'].apply(lambda item:Counter([bg for bg in zip(item.split(), item.split()[1:])]))
df['trigrams'] = df['utterances'].apply(lambda item:Counter([bg for bg in zip(item.split(), item.split()[1:], item.split()[2:])]))
total_bigram = reduce(lambda a, b: a+b , df['bigrams'].to_list())
total_trigram = reduce(lambda a, b: a+b , df['trigrams'].to_list())
print(total_bigram.most_common())
print(total_trigram.most_common())
Upvotes: 2