Reputation: 1754
I am trying to count frequency of various ngrams
using ngram
and freqDist
functions in nltk
.
Due to the fact that the ngram
function output is a generator
object, I would like to merge the output from each ngram before calculating frequency.
However, I am running into problems to merge the various generator objects.
I have tried itertools.chain
, which created an itertools
object, rather than merge the generators.
I have finally settled on permutations
, but to parse the objects afterwards seems redundant.
The working code thus far is:
import nltk
from nltk import word_tokenize, pos_tag
from nltk.collocations import *
from itertools import *
from nltk.util import ngrams
import re
corpus = 'testing sentences to see if if if this works'
token = word_tokenize(corpus)
unigrams = ngrams(token,1)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
perms = list(permutations([unigrams,bigrams,trigrams]))
fdist = nltk.FreqDist(perms)
for x,y in fdist.items():
for k in x:
for v in k:
words = '_'.join(v)
print words, y
As you can see in the results, freq dist is not calculating the words from the individual generator objects properly as each has a frequency of 1. Is there a more pythonic way to do properly do this?
Upvotes: 3
Views: 3638
Reputation: 50220
Alvas is right, nltk.everygrams
is the perfect tool for this job. But merging several iterators is really not that hard, nor that uncommon, so you should know how to do it. The key is that any iterator can be converted to a list, but it's best to do that only once:
Just use lists (simple but inefficient)
allgrams = list(unigrams) + list(bigrams) + list(trigrams)
Or build a single list, properly
allgrams = list(unigrams)
allgrams.extend(bigrams)
allgrams.extend(trigrams)
Or use itertools.chain()
, then make a list
allgrams = list(itertools.chain(unigrams, bigrams, trigrams))
The above produce identical results (as long as you don't try to reuse the iterators unigrams
etc.-- you need to redefine them between examples).
Don't fight iterators, learn to work with them. Many Python functions accept them instead of lists, saving you much space and time.
You could form a single iterator and pass it to nltk.FreqDist()
:
fdist = nltk.FreqDist(itertools.chain(unigrams, bigrams, trigrams))
You can work with multiple iterators. FreqDist
, like Counter
, has an update()
method you can use to count things incrementally:
fdist = nltk.FreqDist(unigrams)
fdist.update(bigrams)
fdist.update(trigrams)
Upvotes: 2
Reputation: 122168
Use everygrams
, it returns the all n-grams given a range of n.
>>> from nltk import everygrams
>>> from nltk import FreqDist
>>> corpus = 'testing sentences to see if if if this works'
>>> everygrams(corpus.split(), 1, 3)
<generator object everygrams at 0x7f4e272e9730>
>>> list(everygrams(corpus.split(), 1, 3))
[('testing',), ('sentences',), ('to',), ('see',), ('if',), ('if',), ('if',), ('this',), ('works',), ('testing', 'sentences'), ('sentences', 'to'), ('to', 'see'), ('see', 'if'), ('if', 'if'), ('if', 'if'), ('if', 'this'), ('this', 'works'), ('testing', 'sentences', 'to'), ('sentences', 'to', 'see'), ('to', 'see', 'if'), ('see', 'if', 'if'), ('if', 'if', 'if'), ('if', 'if', 'this'), ('if', 'this', 'works')]
To combine the counting of different orders of ngrams:
>>> from nltk import everygrams
>>> from nltk import FreqDist
>>> corpus = 'testing sentences to see if if if this works'.split()
>>> fd = FreqDist(everygrams(corpus, 1, 3))
>>> fd
FreqDist({('if',): 3, ('if', 'if'): 2, ('to', 'see'): 1, ('sentences', 'to', 'see'): 1, ('if', 'this'): 1, ('to', 'see', 'if'): 1, ('works',): 1, ('testing', 'sentences', 'to'): 1, ('sentences', 'to'): 1, ('sentences',): 1, ...})
Alternatively, FreqDist
is essentially a collections.Counter
sub-class, so you can combine counters as such:
>>> from collections import Counter
>>> x = Counter([1,2,3,4,4,5,5,5])
>>> y = Counter([1,1,1,2,2])
>>> x + y
Counter({1: 4, 2: 3, 5: 3, 4: 2, 3: 1})
>>> x
>>> from nltk import FreqDist
>>> FreqDist(['a', 'a', 'b'])
FreqDist({'a': 2, 'b': 1})
>>> a = FreqDist(['a', 'a', 'b'])
>>> b = FreqDist(['b', 'b', 'c', 'd', 'e'])
>>> a + b
FreqDist({'b': 3, 'a': 2, 'c': 1, 'e': 1, 'd': 1})
Upvotes: 7