Ryan Wei
Ryan Wei

Reputation: 11

Are there good ways to reduce the size of a vocabulary in natural language processing?

While working on tasks like text classification, QA, the original vocabulary generated from the corpus is usually too large, containing a lot of 'unimportant' words. The most popular ways I've seen to reduce the vocabulary size are discarding stop words and words with low frequencies.

For example, in gensim

gensim.utils.prune_vocab(vocab, min_reduce, trim_rule=None):
    Remove all entries from the vocab dictionary with count smaller than min_reduce.
    Modifies vocab in place, returns the sum of all counts that were pruned.

But in practice, setting the minimum count is empirical and does not seems quite exact. I notice that the term frequency of each word in the vocabulary often follows long-tail distribution, is it a good way if I only keep the top-K words that occupies X% (95%, 90%, 85%, ...) of the total term frequency? Or are there any sensible ways to reduce the vocabulary, without seriously influencing the NLP task?

Upvotes: 1

Views: 3693

Answers (3)

eliangius
eliangius

Reputation: 417

You can significantly reduce vocabulary size via text pre-processing tailored to your learning task & domain. Some NLP techniques include:

  • Remove rare & frequent stop words. Not just from pre-defined lists but through learned thresholds, TF-IDF weights or superfluous part-of-speech removals.
  • Correct spelling/grammar/slang if your text is noisy or from different dialects of the same language.
  • lemmatize words to remove tense & plurality variants if these relationships don't matter. ie: played, playing or plays -> play
  • Parametrize text with named entities whenever specific details aren't needed. ie: <PERSON> bought <MONEY> tickets to <LOCATION> for <DATE>
  • Disambiguate & perform synonym substitution to the most frequent usage of its interpretation. ie: bedrooms are spacious -> rooms are big
  • Simplify contractions & negations. ie: I don't dislike it -> I do not dislike it ~> I like it
  • resolve co-references where pronouns are made explicit. ie: John said he will go -> John said John will go
  • Dimensionality reduce with SVD to automatically capture equivalent phrases.

Upvotes: 1

gojomo
gojomo

Reputation: 54223

In general, the least-frequent words in your training data are also the safest to discard.

This is especially the case for 'word2vec' and similar algorithms. There may not be enough varied examples of the usage of each rare word to learn reliable representations – as opposed to weak/idiosyncratic representations based on the few not-necessarily-representative examples of their use that you do have.

Also, rare words won't recur as often in future texts, making their relative value in the model less.

And, by the typical 'zipfian' distribution of word-frequencies in natural-language material, while each individual rare word only occurs a few times, altogether there are many such words. So just discarding words with one to a few instances will often significantly shrink the vocabulary (and thus overall model) by half or more.

Finally, it's been observed in 'word2vec' that discarding those intervening rare words – which are many in total number, though each individually has only limited-quality examples – the quality of the surviving more-frequent word-vectors often improves. Those more-important words have fewer intervening lower-value 'noisy' words moving them out of each others' context windows, or pulling the model's weights in other directions via interleaved training examples.

(Similarly, in adequate corpuses, using more-aggressive frequent-word downsampling, as controlled by the sample parameter, can often increase word-vector quality while also speeding training – though with no savings in overall vocabulary size, as no words are totally eliminated by that setting.)

On the other hand, 'stop words' are insufficiently numerous to offer much vocabulary-size savings when discarded. Discard them, or not, based on whether their presence helps or hurts your later steps & final results – not to save a tiny amount of vocabulary-driven model space.

Note that for gensim's Word2Vec model, and related algorithms, in addition to the min_count parameter which discards all words appearing fewer times than that value, there is also the max_final_vocab parameter, which will dynamically choose whatever min_count is sufficient to achieve a final vocabulary size no larger than the max_final_vocab value.

So if you know you have the system memory to support a 1-million-word model, you don't have to use trial-and-error on min_count values to reach something near that: you can just specify max_final_vocab=1000000, min_count=1.

(On the other hand, be careful with the max_vocab_size parameter. It should only be used to prevent the initial word-count survey from outgrowing available RAM, and thus should be set to the largest value your system can manage – far, far larger than whatever you'd like your actual final vocabulary size to be. That's because the max_vocab_size is enforced whenever the survey-in-progress reaches that size – not just at the end – and discards a lot of the smaller word counts, and then enforces a higher floor each time it's enforced. If this limit is hit at all, it means final counts will only be approximate – & the escalating floor means sometimes the running-vocabulary will be pruned to a mere 10% or so of the full max_vocab_size.)

Upvotes: 1

dennlinger
dennlinger

Reputation: 11490

There is indeed a few recent developments that try to counteract this problem. The most notable ones are probably subword units (also known as Byte Pair Encodings, or BPEs), which you can imagine as a notion similar to syllables in a word (but not the same!); A word like basketball could then be transformed into variations like bas @@ket @@ball or basket @@ball. Note that this is a constructed example and might not reflect the actually chosen subwords.

The idea itself is relatively old (an article from 1994), but has been recently popularized by Sennrich et al., and is basically used in every state-of-the-art NLP library that has to deal with large vocabularies.

The two biggest implementations of this idea are probably fastBPE and Google's SentencePiece.

With subword units, you now basically have the freedom to determine a fix vocabulary size, and the algorithm will then try to optimize towards a mix of word diversity, and basically splitting "more complex words" into several pieces, such that your desired vocabulary size can cover any word in the corpus. For the exact algorithm, though, I highly recommend you to look into the linked paper or implementations.

Upvotes: 1

Related Questions