Reputation: 1416
I am currently running this code for search for bigram for entire of my text processing.
Variable alltext is really long text (over 1 million words)
I ran this code to extract bigram
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
tokenizer = RegexpTokenizer(r'([A-za-z]{2,})')
tokens = tokenizer.tokenize(alltext)
stopwords_list = stopwords.words('english')
tokens = [word for word in tokens if word not in stopwords.words('english')]
finder = BigramCollocationFinder.from_words(tokens, window_size = 2)
bigram_measures = nltk.collocations.BigramAssocMeasures()
for k,v in finder.ngram_fd.items():
print k,v
The code above searches for the frequency occurrence for possible bigrams.
The code prints me lots of bigrams and its number of occurrence.
The output is similar to this.
(('upper', 'front'), 1)
(('pad', 'Teething'), 1)
(('shoulder', 'strap'), 1)
(('outer', 'breathable'), 1)
(('memory', 'foam'), 1)
(('shields', 'inner'), 1)
(('The', 'garment'), 2)
......
type(finder.ngram_fd.items()) is a list.
How can i sort the frequency from highest to lowest number of occurrence. My desire result would be.
(('The', 'garment'), 2)
(('upper', 'front'), 1)
(('pad', 'Teething'), 1)
(('shoulder', 'strap'), 1)
(('outer', 'breathable'), 1)
(('memory', 'foam'), 1)
(('shields', 'inner'), 1)
Thank you very much, I am quite new to nltk and text processing so my explanation would not be as clear.
Upvotes: 2
Views: 3477
Reputation: 9104
It looks like finder.ngram_fd
is a dictionary. In that case, in Python 3 the items()
method does not return a list, so you'll have to cast it to one.
Once you have a list, you can simply use the key=
parameter of the sort()
method, which specifies what we're sorting against:
ngram = list(finder.ngram_fd.items())
ngram.sort(key=lambda item: item[-1], reverse=True)
You have to add reverse=True
because otherwise the results would be in ascending order. Note that this will sort the list in place. This is best when you want to avoid copying. If instead you wish to obtain a new list, just use the sorted()
built-in function with the same arguments.
Alternatively, you can replace the lambda with operator.itemgetter
module, which does the same thing:
ngram.sort(key=operator.itemgetter(-1), reverse=True)
Upvotes: 4