Reputation: 330
I'm reading François Chollet book "Deep Learning with Python", and in page 204 it suggests that the phrase The cat sat on the mat.
would originate the following 2-grams:
{"The", "The cat", "cat", "cat sat", "sat",
"sat on", "on", "on the", "the", "the mat", "mat"}
Source:
However, every implementation of n-grams that I have saw (nltk, tensorflow), encodes the same phrase like this following:
[('The', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'the'), ('the', 'mat.')]
Am I missing some detail? (I'm new to natural language processing, so that might be the case)
Or it's the book implementation wrong/outdated?
Upvotes: 0
Views: 277
Reputation: 11460
I want to slightly expand on the other answer given, specifically to the "clearly wrong". While I agree that it is not the standard approach (to my knowledge!), there is an important definition in the mentioned book, just before the shown excerpt, which states:
Word n-grams are groups of N (or fewer) consecutive words that you can extract froma sentence. The same concept may also be applied to characters instead of words
(bold highlight by me). It seems that Chollet defines n-grams slightly different from the common interpretation (namely, that a n-gram has to consist of exactly n words/chars etc.). With that, the subsequent example is entirely within the
defined circumstances, although you likely will find varying implementations of this in the real world.
One example aside from the mentioned Tensorflow/NLTK implementation would be scikit-learn's TfidfVectorizer
, which has the parameter ngram_range
. This is basically something in between Chollet's definition and a strict interpretation, where you can select an arbitrary minimum/maximum amount of "grams" for a single unit, which are then built similar to the above example where a single bag can have both unigrams and bigrams, for example.
Upvotes: 2
Reputation: 1882
Book implementation is incorrect. It is mixing unigrams (1-grams) with bigrams (2-grams).
Upvotes: 0