owwoow14
owwoow14

Reputation: 1754

Merge generator objects to calculate frequency in NLTK

I am trying to count frequency of various ngrams using ngram and freqDist functions in nltk. Due to the fact that the ngram function output is a generator object, I would like to merge the output from each ngram before calculating frequency. However, I am running into problems to merge the various generator objects.

I have tried itertools.chain, which created an itertools object, rather than merge the generators. I have finally settled on permutations, but to parse the objects afterwards seems redundant.

The working code thus far is:

import nltk
from nltk import word_tokenize, pos_tag
from nltk.collocations import *
from itertools import *
from nltk.util import ngrams
import re
corpus = 'testing sentences to see if if if this works'
token = word_tokenize(corpus)
unigrams = ngrams(token,1)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)


perms = list(permutations([unigrams,bigrams,trigrams]))
fdist = nltk.FreqDist(perms)
for x,y in fdist.items():
    for k in x:
        for v in k:
            words = '_'.join(v)
            print words, y

As you can see in the results, freq dist is not calculating the words from the individual generator objects properly as each has a frequency of 1. Is there a more pythonic way to do properly do this?

Upvotes: 3

Views: 3638

Answers (2)

alexis
alexis

Reputation: 50220

Alvas is right, nltk.everygrams is the perfect tool for this job. But merging several iterators is really not that hard, nor that uncommon, so you should know how to do it. The key is that any iterator can be converted to a list, but it's best to do that only once:

Make a list out of several iterators

  1. Just use lists (simple but inefficient)

    allgrams = list(unigrams) + list(bigrams) + list(trigrams)
    
  2. Or build a single list, properly

    allgrams = list(unigrams)
    allgrams.extend(bigrams)
    allgrams.extend(trigrams)
    
  3. Or use itertools.chain(), then make a list

    allgrams = list(itertools.chain(unigrams, bigrams, trigrams))
    

The above produce identical results (as long as you don't try to reuse the iterators unigrams etc.-- you need to redefine them between examples).

Use the iterators themselves

Don't fight iterators, learn to work with them. Many Python functions accept them instead of lists, saving you much space and time.

  1. You could form a single iterator and pass it to nltk.FreqDist():

    fdist = nltk.FreqDist(itertools.chain(unigrams, bigrams, trigrams))
    
  2. You can work with multiple iterators. FreqDist, like Counter, has an update() method you can use to count things incrementally:

    fdist = nltk.FreqDist(unigrams)
    fdist.update(bigrams)
    fdist.update(trigrams)
    

Upvotes: 2

alvas
alvas

Reputation: 122168

Use everygrams, it returns the all n-grams given a range of n.

>>> from nltk import everygrams
>>> from nltk import FreqDist
>>> corpus = 'testing sentences to see if if if this works'
>>> everygrams(corpus.split(), 1, 3)
<generator object everygrams at 0x7f4e272e9730>
>>> list(everygrams(corpus.split(), 1, 3))
[('testing',), ('sentences',), ('to',), ('see',), ('if',), ('if',), ('if',), ('this',), ('works',), ('testing', 'sentences'), ('sentences', 'to'), ('to', 'see'), ('see', 'if'), ('if', 'if'), ('if', 'if'), ('if', 'this'), ('this', 'works'), ('testing', 'sentences', 'to'), ('sentences', 'to', 'see'), ('to', 'see', 'if'), ('see', 'if', 'if'), ('if', 'if', 'if'), ('if', 'if', 'this'), ('if', 'this', 'works')]

To combine the counting of different orders of ngrams:

>>> from nltk import everygrams
>>> from nltk import FreqDist
>>> corpus = 'testing sentences to see if if if this works'.split()
>>> fd = FreqDist(everygrams(corpus, 1, 3))
>>> fd
FreqDist({('if',): 3, ('if', 'if'): 2, ('to', 'see'): 1, ('sentences', 'to', 'see'): 1, ('if', 'this'): 1, ('to', 'see', 'if'): 1, ('works',): 1, ('testing', 'sentences', 'to'): 1, ('sentences', 'to'): 1, ('sentences',): 1, ...})

Alternatively, FreqDist is essentially a collections.Counter sub-class, so you can combine counters as such:

>>> from collections import Counter
>>> x = Counter([1,2,3,4,4,5,5,5])
>>> y = Counter([1,1,1,2,2])
>>> x + y
Counter({1: 4, 2: 3, 5: 3, 4: 2, 3: 1})
>>> x

>>> from nltk import FreqDist
>>> FreqDist(['a', 'a', 'b'])
FreqDist({'a': 2, 'b': 1})
>>> a = FreqDist(['a', 'a', 'b'])
>>> b = FreqDist(['b', 'b', 'c', 'd', 'e'])
>>> a + b
FreqDist({'b': 3, 'a': 2, 'c': 1, 'e': 1, 'd': 1})

Upvotes: 7

Related Questions