Generate unigrams and bigrams from a trigram list

Question

I am looking at potential ways of just storing the trigram frequencies in memory and calculating the unigram and bigram frequencies on the fly in the following way :

Given a trigram u , v , w :

count(v, w) = sum (.,v,w) i.e sum over all u

Similarly, count(w) = sum(.,w)

This sure does result in a few missing unigrams, for example the sentence begin marker , but does this sound like a valid approach to generating unigrams and bigrams ?

dhg · Accepted Answer

Yes. That will work. You can check it by making yourself a tiny corpus and manually doing the counting to ensure that it comes out the same.

from collections import Counter

corpus = [['the','dog','walks'], ['the','dog','runs'], ['the','cat','runs']]
corpus_with_ends = [['',''] + s + [''] for s in corpus]

trigram_counts = Counter(trigram for s in corpus_with_ends for trigram in zip(s,s[1:],s[2:]))

unique_bigrams = set((b,c) for a,b,c in trigram_counts)
bigram_counts = dict((bigram,sum(count for trigram,count in trigram_counts.iteritems() if trigram[1:] == bigram)) for bigram in unique_bigrams)

unique_unigrams = set((c,) for a,b,c in trigram_counts if c != '')
unigram_counts = dict((unigram,sum(count for trigram,count in trigram_counts.iteritems() if trigram[2:] == unigram)) for unigram in unique_unigrams)

Now you can check things:

>>> true_bigrams = [bigram for s in corpus_with_ends for bigram in zip(s[1:],s[2:])] >>> true_bigram_counts = Counter(true_bigrams) >>> bigram_counts == true_bigram_counts True >>> true_unigrams = [(unigram,) for s in corpus_with_ends for unigram in s[2:-1]] >>> true_unigram_counts = Counter(true_unigrams) >>> unigram_counts == true_unigram_counts True

Generate unigrams and bigrams from a trigram list

Answers (1)

Related Questions