CutePoison
CutePoison

Reputation: 5355

Remove the 1 and 2-grams from CountVectorizer that are contained in 3-gram

Say I have the following sentences ["hello", "foo bar baz"] and I want to get 1,2 and 3-gram if the 1 and 2-grams are not in the 3-gram i.e for the two sentences above I would like a vocabulary being [("hello"), ("foo bar baz")].

If I use CountVectorizer with ngram_range = (1,3) I would get the uni-grams foo, bar and baz and their bi-grams as well. thus I can't just set ngram_range=(3,3).

Is there a way of doing that in any way without doing seriously work-around?

Upvotes: 1

Views: 164

Answers (1)

DataJanitor
DataJanitor

Reputation: 1761

Unfortunately, scikit-learn does not provide a straightforward way of generating unique n-grams. Here's a simple way using nltk to achieve what you're asking:

from nltk import ngrams
from collections import Counter

def unique_ngrams(texts, n_range):
    all_ngrams = []
    for n in range(n_range[0], n_range[1]+1):
        for text in texts:
            tokens = text.split()
            grams = list(ngrams(tokens, n))
            all_ngrams.extend(grams)

    # Count the occurrences of each ngram
    ngram_counts = Counter(all_ngrams)
    
    # Keep only the ngrams that occur once (are unique)
    unique_ngrams = [ngram for ngram, count in ngram_counts.items() if count == 1]
    
    return unique_ngrams

texts = ["hello", "foo bar baz", "baz bar foo", "foo bar"]
print(unique_ngrams(texts, (1, 3)))

With this code, we first generate all n-grams within the specified range for each text. We then count the occurrences of each n-gram across all texts. Finally, we keep only the n-grams that occur once, which are the n-grams that don't have any sub-n-grams present in the corpus.

Output:

[('hello',), ('bar', 'baz'), ('baz', 'bar'), ('bar', 'foo'), ('foo', 'bar', 'baz'), ('baz', 'bar', 'foo')]

Upvotes: 0

Related Questions