Reputation: 1631
I want to build a word cloud containing multiple word structures (not just one word). In any given text we will have bigger frequencies for unigrams than bigrams. Actually, the n-gram frequency decreases when n increases for the same text.
I want to find a magic number or a method to obtain comparative results between unigrams and bigrams, trigrams, n-grams.
There is any magic number as a multiplier for n-gram frequency in order to be comparable with a unigram?
A solution that I have now in mind is to make a top for any n-gram (1, 2, 3, ...) and use the first z positions for any category of n-grams.
Upvotes: 0
Views: 852
Reputation: 77857
As you've asked this, there is no simple linear multiplier. You can make a general estimate by the size of your set of units. Consider the English alphabet of 26 letters: you have 26 possible unigrams, 26^2 digrams, 26^3 trigrams, ... Simple treatment suggests that you would multiply a digram's frequency by 26 to compare it with unigrams; trigram frequencies would get a 26^2 boost.
I don't know whether that achieves the comparison you want, as the actual distribution of n-grams is not according to any mathematically tractable function. For instance, letter-trigram distribution is a good way to differentiate the language in use: English, French, Spanish, German, Romanian, etc. have readily differential distributions.
Another possibility is to normalize the data: convert each value into a z-score, the amount of standard deviations above/below the mean of the distribution. The resulting list of z-scores has a mean of 0 and a standev of 1.0
Does either of those get you the results you need?
Upvotes: 1