Ameera
Ameera

Reputation: 33

get unmatched words after CountVectorizer transform

I am using count vectorizer to apply string matching in a large dataset of texts. What I want is to get the words that do not match any term in the resulting matrix. For example, if the resulting terms (features) after fitting are:

{'hello world', 'world and', 'and stackoverflow', 'hello', 'world', 'stackoverflow', 'and'}

and I transformed this text:

"oh hello world and stackoverflow this is a great morning"

I want to get the string oh this is a greate morining since it matches nothing in the features. Is there any efficient method to do this?

I tried using inverse_transform method to get the features and remove them from the text but I ran into many problems and long running time.

Upvotes: 1

Views: 1876

Answers (1)

tttthomasssss
tttthomasssss

Reputation: 5971

Transforming a piece of text on the basis of a fitted vocabulary is going to return you a matrix with counts of the known vocabulary.

For example, if your input document is as in your example:

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(ngram_range=(1, 2))

docs = ['hello world and stackoverflow']
vec.fit(docs)

Then the fitted vocabulary might look as follows:

In [522]: print(vec.vocabulary_)
{'hello': 2, 
 'world': 5, 
 'and': 0, 
 'stackoverflow': 4, 
 'hello world': 3, 
 'world and': 6, 
 'and stackoverflow': 1}

Which represents a token to index mapping. Transforming some new documents subsequently returns a matrix with counts of all known vocabulary tokens. Words that are not in the vocabulary are ignored!

other_docs = ['hello stackoverflow', 
              'hello and hello', 
              'oh hello world and stackoverflow this is a great morning']

X = vec.transform(other_docs)

In [523]: print(X.A)
[[0 0 1 0 1 0 0]
[1 0 2 0 0 0 0]
[1 1 1 1 1 1 1]]

Your vocabulary consists of 7 items, hence the matrix X contains 7 columns. And we've transformed 3 documents, so its a 3x7 matrix. The elements of the matrix are counts, of how often the particular word occurred in the document. For example for the second document "hello and hello", we have a count of 2 in column 2 (0-indexed) and a count of 1 in column 0, which refer to "hello" and "and", respectively.

Inverse transform is then a mapping from features (i.e. indices) back to the vocabulary items:

In [534]: print(vec.inverse_transform([1, 2, 3, 4, 5, 6, 7]))
[array(['and', 'and stackoverflow', 'hello', 'hello world',
   'stackoverflow', 'world', 'world and'], dtype='<U17')]

Note: This is now 1-indexed w.r.t. to the vocabulary indices printed above.

Now lets get to your actual question, which is identifying all out-of-vocabulary (OOV) items in a given input document. Its fairly straightforward using sets if you're only interested in unigrams:

tokens = 'oh hello world and stackoverflow this is a great morning'.split()
In [542]: print(set(tokens) - set(vec.vocabulary_.keys()))
{'morning', 'a', 'is', 'this', 'oh', 'great'}

Things are slightly more involved if you're also interested in bigrams (or any other n-gram where n > 1), as first you'd need to generate all bigrams from your input document (note there are various ways to generate all ngrams from an input document of which the following is just one):

bigrams = list(map(lambda x: ' '.join(x), zip(tokens, tokens[1:])))
In [546]: print(bigrams)
['oh hello', 'hello world', 'world and', 'and stackoverflow', 'stackoverflow     this', 'this is', 'is a', 'a great', 'great morning']

This line looks fancy, but all it does is zip two lists together (with the second list starting at the second item), which results in a tuple such as ('oh', 'hello'), the map statement just joins the tuple by a single space in order to transform ('oh', 'hello') into 'oh hello', and subsequently the map generator is converted into a list. Now you can build the union of unigrams and bigrams:

doc_vocab = set(tokens) | set(bigrams)
In [549]: print(doc_vocab)
{'and stackoverflow', 'hello', 'a', 'morning', 'hello world', 'great morning', 'world', 'stackoverflow', 'stackoverflow this', 'is', 'world and', 'oh hello', 'oh', 'this', 'is a', 'this is', 'and', 'a great', 'great'}

Now you can do the same as with unigrams above to retrieve all OOV items:

In [550]: print(doc_vocab - set(vec.vocabulary_.keys()))
{'morning', 'a', 'great morning', 'stackoverflow this', 'is a', 'is', 'oh hello', 'this', 'this is', 'oh', 'a great', 'great'}

This now represents all unigrams and bigrams that are not in your vectorizer's vocabulary.

Upvotes: 4

Related Questions