Reputation: 27
I am looking at potential ways of just storing the trigram frequencies in memory and calculating the unigram and bigram frequencies on the fly in the following way :
Given a trigram u , v , w :
count(v, w) = sum (.,v,w) i.e sum over all u
Similarly, count(w) = sum(.,w)
This sure does result in a few missing unigrams, for example the sentence begin marker , but does this sound like a valid approach to generating unigrams and bigrams ?
Upvotes: 1
Views: 1089
Reputation: 52681
Yes. That will work. You can check it by making yourself a tiny corpus and manually doing the counting to ensure that it comes out the same.
from collections import Counter
corpus = [['the','dog','walks'], ['the','dog','runs'], ['the','cat','runs']]
corpus_with_ends = [['<s>','<s>'] + s + ['<e>'] for s in corpus]
trigram_counts = Counter(trigram for s in corpus_with_ends for trigram in zip(s,s[1:],s[2:]))
unique_bigrams = set((b,c) for a,b,c in trigram_counts)
bigram_counts = dict((bigram,sum(count for trigram,count in trigram_counts.iteritems() if trigram[1:] == bigram)) for bigram in unique_bigrams)
unique_unigrams = set((c,) for a,b,c in trigram_counts if c != '<e>')
unigram_counts = dict((unigram,sum(count for trigram,count in trigram_counts.iteritems() if trigram[2:] == unigram)) for unigram in unique_unigrams)
Now you can check things:
>>> true_bigrams = [bigram for s in corpus_with_ends for bigram in zip(s[1:],s[2:])]
>>> true_bigram_counts = Counter(true_bigrams)
>>> bigram_counts == true_bigram_counts
True
>>> true_unigrams = [(unigram,) for s in corpus_with_ends for unigram in s[2:-1]]
>>> true_unigram_counts = Counter(true_unigrams)
>>> unigram_counts == true_unigram_counts
True
Upvotes: 3