Merge generator objects to calculate frequency in NLTK

Question

I am trying to count frequency of various ngrams using ngram and freqDist functions in nltk. Due to the fact that the ngram function output is a generator object, I would like to merge the output from each ngram before calculating frequency. However, I am running into problems to merge the various generator objects.

I have tried itertools.chain, which created an itertools object, rather than merge the generators. I have finally settled on permutations, but to parse the objects afterwards seems redundant.

The working code thus far is:

import nltk
from nltk import word_tokenize, pos_tag
from nltk.collocations import *
from itertools import *
from nltk.util import ngrams
import re
corpus = 'testing sentences to see if if if this works'
token = word_tokenize(corpus)
unigrams = ngrams(token,1)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)


perms = list(permutations([unigrams,bigrams,trigrams]))
fdist = nltk.FreqDist(perms)
for x,y in fdist.items():
    for k in x:
        for v in k:
            words = '_'.join(v)
            print words, y

As you can see in the results, freq dist is not calculating the words from the individual generator objects properly as each has a frequency of 1. Is there a more pythonic way to do properly do this?

alvas · Accepted Answer

Use everygrams, it returns the all n-grams given a range of n.

>>> from nltk import everygrams
>>> from nltk import FreqDist
>>> corpus = 'testing sentences to see if if if this works'
>>> everygrams(corpus.split(), 1, 3)

>>> list(everygrams(corpus.split(), 1, 3))
[('testing',), ('sentences',), ('to',), ('see',), ('if',), ('if',), ('if',), ('this',), ('works',), ('testing', 'sentences'), ('sentences', 'to'), ('to', 'see'), ('see', 'if'), ('if', 'if'), ('if', 'if'), ('if', 'this'), ('this', 'works'), ('testing', 'sentences', 'to'), ('sentences', 'to', 'see'), ('to', 'see', 'if'), ('see', 'if', 'if'), ('if', 'if', 'if'), ('if', 'if', 'this'), ('if', 'this', 'works')]

To combine the counting of different orders of ngrams:

>>> from nltk import everygrams
>>> from nltk import FreqDist
>>> corpus = 'testing sentences to see if if if this works'.split()
>>> fd = FreqDist(everygrams(corpus, 1, 3))
>>> fd
FreqDist({('if',): 3, ('if', 'if'): 2, ('to', 'see'): 1, ('sentences', 'to', 'see'): 1, ('if', 'this'): 1, ('to', 'see', 'if'): 1, ('works',): 1, ('testing', 'sentences', 'to'): 1, ('sentences', 'to'): 1, ('sentences',): 1, ...})

Alternatively, FreqDist is essentially a collections.Counter sub-class, so you can combine counters as such:

>>> from collections import Counter
>>> x = Counter([1,2,3,4,4,5,5,5])
>>> y = Counter([1,1,1,2,2])
>>> x + y
Counter({1: 4, 2: 3, 5: 3, 4: 2, 3: 1})
>>> x

>>> from nltk import FreqDist
>>> FreqDist(['a', 'a', 'b'])
FreqDist({'a': 2, 'b': 1})
>>> a = FreqDist(['a', 'a', 'b'])
>>> b = FreqDist(['b', 'b', 'c', 'd', 'e'])
>>> a + b
FreqDist({'b': 3, 'a': 2, 'c': 1, 'e': 1, 'd': 1})

Merge generator objects to calculate frequency in NLTK

Answers (2)

Make a list out of several iterators

Use the iterators themselves

Related Questions