subzero
subzero

Reputation: 27

Generate unigrams and bigrams from a trigram list

I am looking at potential ways of just storing the trigram frequencies in memory and calculating the unigram and bigram frequencies on the fly in the following way :

Given a trigram u , v , w :

count(v, w) = sum (.,v,w) i.e sum over all u

Similarly, count(w) = sum(.,w)

This sure does result in a few missing unigrams, for example the sentence begin marker , but does this sound like a valid approach to generating unigrams and bigrams ?

Upvotes: 1

Views: 1089

Answers (1)

dhg
dhg

Reputation: 52681

Yes. That will work. You can check it by making yourself a tiny corpus and manually doing the counting to ensure that it comes out the same.

from collections import Counter

corpus = [['the','dog','walks'], ['the','dog','runs'], ['the','cat','runs']]
corpus_with_ends = [['<s>','<s>'] + s + ['<e>'] for s in corpus]

trigram_counts = Counter(trigram for s in corpus_with_ends for trigram in zip(s,s[1:],s[2:]))

unique_bigrams = set((b,c) for a,b,c in trigram_counts)
bigram_counts = dict((bigram,sum(count for trigram,count in trigram_counts.iteritems() if trigram[1:] == bigram)) for bigram in unique_bigrams)

unique_unigrams = set((c,) for a,b,c in trigram_counts if c != '<e>')
unigram_counts = dict((unigram,sum(count for trigram,count in trigram_counts.iteritems() if trigram[2:] == unigram)) for unigram in unique_unigrams)

Now you can check things:

>>> true_bigrams = [bigram for s in corpus_with_ends for bigram in zip(s[1:],s[2:])]
>>> true_bigram_counts = Counter(true_bigrams)
>>> bigram_counts == true_bigram_counts
True

>>> true_unigrams = [(unigram,) for s in corpus_with_ends for unigram in s[2:-1]]
>>> true_unigram_counts = Counter(true_unigrams)
>>> unigram_counts == true_unigram_counts
True

Upvotes: 3

Related Questions