Diogo Silva
Diogo Silva

Reputation: 330

What's the correct implementation of "bag of n-grams"?

I'm reading François Chollet book "Deep Learning with Python", and in page 204 it suggests that the phrase The cat sat on the mat. would originate the following 2-grams:

{"The", "The cat", "cat", "cat sat", "sat",
"sat on", "on", "on the", "the", "the mat", "mat"}

Source: François Chollet Book page 204 However, every implementation of n-grams that I have saw (nltk, tensorflow), encodes the same phrase like this following:

[('The', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'the'), ('the', 'mat.')]

Am I missing some detail? (I'm new to natural language processing, so that might be the case)

Or it's the book implementation wrong/outdated?

Upvotes: 0

Views: 277

Answers (2)

dennlinger
dennlinger

Reputation: 11460

I want to slightly expand on the other answer given, specifically to the "clearly wrong". While I agree that it is not the standard approach (to my knowledge!), there is an important definition in the mentioned book, just before the shown excerpt, which states:

Word n-grams are groups of N (or fewer) consecutive words that you can extract froma sentence. The same concept may also be applied to characters instead of words

(bold highlight by me). It seems that Chollet defines n-grams slightly different from the common interpretation (namely, that a n-gram has to consist of exactly n words/chars etc.). With that, the subsequent example is entirely within the defined circumstances, although you likely will find varying implementations of this in the real world.
One example aside from the mentioned Tensorflow/NLTK implementation would be scikit-learn's TfidfVectorizer, which has the parameter ngram_range. This is basically something in between Chollet's definition and a strict interpretation, where you can select an arbitrary minimum/maximum amount of "grams" for a single unit, which are then built similar to the above example where a single bag can have both unigrams and bigrams, for example.

Upvotes: 2

Adnan S
Adnan S

Reputation: 1882

Book implementation is incorrect. It is mixing unigrams (1-grams) with bigrams (2-grams).

Upvotes: 0

Related Questions