Sorting Bigram by number of occurrence NLTK

Question

I am currently running this code for search for bigram for entire of my text processing.

Variable alltext is really long text (over 1 million words)

I ran this code to extract bigram

from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re



tokenizer = RegexpTokenizer(r'([A-za-z]{2,})')
tokens = tokenizer.tokenize(alltext)
stopwords_list = stopwords.words('english')
tokens = [word for word in tokens if word not in stopwords.words('english')]
finder = BigramCollocationFinder.from_words(tokens, window_size = 2)
bigram_measures = nltk.collocations.BigramAssocMeasures()

for k,v in finder.ngram_fd.items():
    print k,v

The code above searches for the frequency occurrence for possible bigrams.

The code prints me lots of bigrams and its number of occurrence.

The output is similar to this.

(('upper', 'front'), 1)
(('pad', 'Teething'), 1)
(('shoulder', 'strap'), 1)
(('outer', 'breathable'), 1)
(('memory', 'foam'), 1)
(('shields', 'inner'), 1)
(('The', 'garment'), 2)
......

type(finder.ngram_fd.items()) is a list.

How can i sort the frequency from highest to lowest number of occurrence. My desire result would be.

(('The', 'garment'), 2)
(('upper', 'front'), 1)
(('pad', 'Teething'), 1)
(('shoulder', 'strap'), 1)
(('outer', 'breathable'), 1)
(('memory', 'foam'), 1)
(('shields', 'inner'), 1)

Thank you very much, I am quite new to nltk and text processing so my explanation would not be as clear.

rubik · Accepted Answer

It looks like finder.ngram_fd is a dictionary. In that case, in Python 3 the items() method does not return a list, so you'll have to cast it to one.

Once you have a list, you can simply use the key= parameter of the sort() method, which specifies what we're sorting against:

ngram = list(finder.ngram_fd.items())
ngram.sort(key=lambda item: item[-1], reverse=True)

You have to add reverse=True because otherwise the results would be in ascending order. Note that this will sort the list in place. This is best when you want to avoid copying. If instead you wish to obtain a new list, just use the sorted() built-in function with the same arguments.

Alternatively, you can replace the lambda with operator.itemgetter module, which does the same thing:

ngram.sort(key=operator.itemgetter(-1), reverse=True)

Sorting Bigram by number of occurrence NLTK

Answers (1)

Related Questions