most common 2-grams using python

Given a string:

this is a test this is

How can I find the top-n most common 2-grams? In the string above, all 2-grams are:

{this is, is a, test this, this is}

As you can notice, the 2-gram this is appears 2 times. Hence the result should be:

{this is: 2}

I know I can use Counter.most_common() method to find the most common elements, but how can I create a list of 2-grams from the string to begin with?

Upvotes: 6

Answers (3)

aTc Creator

Reputation: 33

The simplest way of doing this is:

s = "this is a test this is"
words = s.split()
words_zip = zip(words, words[1:])
two_grams_list = [item for item in words_zip]
print(two_grams_list)

Above code will give you list of all two-grams like:

[('this', 'is'), ('is', 'a'), ('a', 'test'), ('test', 'this'), ('this', 'is')]

Now, we need to count the frequency of each two-grams

count_freq = {}
for item in two_grams_list:
    if item in count_freq:
        count_freq[item] +=1
    else:
        count_freq[item] = 1

Now, we need to sort the result in descending order and print the result.

sorted_two_grams = sorted(count_freq.items(), key=lambda item: item[1], reverse = True)
print(sorted_two_grams)

output:

[(('this', 'is'), 2), (('is', 'a'), 1), (('a', 'test'), 1), (('test', 'this'), 1)]

Upvotes: 1

zmbq

Reputation: 39023

Well, you can use

words = s.split() # s is the original string
pairs = [(words[i], words[i+1]) for i in range(len(words)-1)]

(words[i], words[i+1]) is the pair of words at place i and i+1, and we go over all pairs from (0,1) to (n-2, n-1) with n being the length of the string s.

Upvotes: 2

Martin Valgur

Reputation: 6322

You can use the method provided in this blog post to conveniently create n-grams in Python.

from collections import Counter

bigrams = zip(words, words[1:])
counts = Counter(bigrams)
print(counts.most_common())

That assumes that the input is a list of words, of course. If your input is a string like the one you provided (which does not have any punctuation), then you can do just words = text.split(' ') to get a list of words. In general, though, you would have to take punctuation, whitespace and other non-alphabetic characters into account. In that case you might do something like

import re

words = re.findall(r'[A-Za-z]+', text)

or you could use an external library such as nltk.tokenize.

Edit. If you need tri-grams or any other any other n-grams in general then you can use the function provided in the blog post I linked to:

def find_ngrams(input_list, n):
  return zip(*(input_list[i:] for i in range(n)))

trigrams = find_ngrams(words, 3)

Upvotes: 9

most common 2-grams using python

Answers (3)

Related Questions