Reputation: 5355
Say I have the following sentences ["hello", "foo bar baz"] and I want to get 1,2 and 3-gram if the 1 and 2-grams are not in the 3-gram i.e for the two sentences above I would like a vocabulary being [("hello"), ("foo bar baz")]
.
If I use CountVectorizer with ngram_range = (1,3)
I would get the uni-grams foo
, bar
and baz
and their bi-grams as well. thus I can't just set ngram_range=(3,3)
.
Is there a way of doing that in any way without doing seriously work-around?
Upvotes: 1
Views: 164
Reputation: 1761
Unfortunately, scikit-learn
does not provide a straightforward way of generating unique n-grams. Here's a simple way using nltk
to achieve what you're asking:
from nltk import ngrams
from collections import Counter
def unique_ngrams(texts, n_range):
all_ngrams = []
for n in range(n_range[0], n_range[1]+1):
for text in texts:
tokens = text.split()
grams = list(ngrams(tokens, n))
all_ngrams.extend(grams)
# Count the occurrences of each ngram
ngram_counts = Counter(all_ngrams)
# Keep only the ngrams that occur once (are unique)
unique_ngrams = [ngram for ngram, count in ngram_counts.items() if count == 1]
return unique_ngrams
texts = ["hello", "foo bar baz", "baz bar foo", "foo bar"]
print(unique_ngrams(texts, (1, 3)))
With this code, we first generate all n-grams within the specified range for each text. We then count the occurrences of each n-gram across all texts. Finally, we keep only the n-grams that occur once, which are the n-grams that don't have any sub-n-grams present in the corpus.
Output:
[('hello',), ('bar', 'baz'), ('baz', 'bar'), ('bar', 'foo'), ('foo', 'bar', 'baz'), ('baz', 'bar', 'foo')]
Upvotes: 0