Reputation: 1

Scikit CountVectorizer find least common words

I need to extract top X LEAST common words with CountVectorizer, however I was not able to find a way to do it.

I'm using multiple CountVectorizers in FeatureUnion.

union = FeatureUnion([('words', CountVectorizer(ngram_range=(1, 3), analyzer='word', max_features=200)),
                      ('chars', CountVectorizer(ngram_range=(1, 4), analyzer='char', max_features=200))])

X_train = union.fit_transform(train_texts)
X_test = union.transform(test_texts)

I would need to reverse the order somehow to make CountVectorizer return least common words. Is there a way to do it? I basically need 200 least common n-gram from both word and char n-grams.

Upvotes: 0

Answers (1)

rickhg12hs

Reputation: 11922

Here's an ipython demonstration of how you can determine the least common occurrences of the specified ngrams. Comments in the code describe the methodology.

$ ipython
Python 3.10.9 (main, Dec  7 2022, 00:00:00) [GCC 12.2.1 20221121 (Red Hat 12.2.1-4)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.9.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from faker import Faker

In [2]: faker = Faker()

In [3]: corpus = faker.sentences(10000)

In [4]: corpus[:5]  # first 5 sentences of nonsense
Out[4]: 
['Drug road condition space dog after key.',
 'Piece myself music society.',
 'Assume gas evening cut majority own.',
 'This both part we.',
 'Far life summer those line nature.']

In [5]: from sklearn.pipeline import FeatureUnion

In [6]: from sklearn.feature_extraction.text import CountVectorizer

In [7]: union = FeatureUnion([('words', CountVectorizer(ngram_range=(1, 3), analyzer='word')),
   ...:                       ('chars', CountVectorizer(ngram_range=(1, 4), analyzer='char'))])

In [8]: X = union.fit_transform(corpus)  # matrix of ngram counts, X.shape == (10000, 91734)

In [9]: ngram_counts = X.sum(axis=0).A1  # vector of counts over all sentences, shape == (91734,)

In [10]: ngram_count_sort_indices = ngram_counts.argsort()  # get indices of sort

In [11]: union.get_feature_names_out()[ngram_count_sort_indices[:20]]  # show first 20 least common ngrams - change to whatever is needed

And for the faker nonsense sentences, here are the 20 least common ngrams (predictably, they are all single occurrences). The index parameter in line of code above ([:20])can be easily changed to whatever number you want.

Out[11]: 
array(['words__office where candidate', 'words__protect doctor',
       'words__protect do poor', 'words__protect do',
       'words__protect dark according', 'words__protect dark',
       'words__protect create someone', 'words__protect create',
       'words__protect church', 'words__protect charge surface',
       'words__protect charge', 'words__protect chance ever',
       'words__protect chance', 'words__protect can air',
       'words__protect can', 'words__protect author',
       'words__protect doctor long', 'words__protect ago',
       'words__protect drop', 'words__protect factor'], dtype=object)

If you want to strip the pipeline label from the ngrams and only get the words/chars, you could:

In [12]: least_common = union.get_feature_names_out()[ngram_count_sort_indices[:20]]

In [13]: [x.split("__")[1] for x in least_common]
Out[13]: 
['office where candidate',
 'protect doctor',
 'protect do poor',
 'protect do',
 'protect dark according',
 'protect dark',
 'protect create someone',
 'protect create',
 'protect church',
 'protect charge surface',
 'protect charge',
 'protect chance ever',
 'protect chance',
 'protect can air',
 'protect can',
 'protect author',
 'protect doctor long',
 'protect ago',
 'protect drop',
 'protect factor']

Upvotes: -1

Scikit CountVectorizer find least common words

Answers (1)

Related Questions