buydadip
buydadip

Reputation: 9427

How to increase weight of a word for CountVectorizer

I have a document that I tokenized, and then I take another document and I compare the two by calculating their cosine similarity.

However, before I calculate their similarity, I want to increase the weight of one of the words beforehand. I'm thinking of doing this by doubling the count of that word, but I do not know how to do that.

Suppose I have the following...

text = [
    "This is a test",
    "This is something else",
    "This is also a test"
]

test = ["This is something"]

Next I define the stop words and call CountVectorizer for both sets of documents.

stopWords = set(stopwords.words('english'))

vectorizer = CountVectorizer(stop_words=stopWords)

trainVectorizerArray = vectorizer.fit_transform(text).toarray()
testVectorizerArray = vectorizer.transform(test).toarray()

In the next part I calculate the cosine similarity...

cosine_function = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

for vector in trainVectorizerArray:
    print(vector)
    for testV in testVectorizerArray:
        print(testV)
        cosine = cosine_function(vector, testV)
        print(cosine)

However, before I calculate the similarity, how can I increase the weight of one of the words. Suppose in this example I want to increase the weight of something, how can I do that? I think you do this by increasing the word count but I do not know how to increase that.

Upvotes: 4

Views: 2748

Answers (1)

piman314
piman314

Reputation: 5355

I think the easiest way would be to use the get_feature_names function for your CountVectorizer in combination with the cosine function in scipy.spatial.distance. But be aware that this computes cosine distance rather than similarity, so if you are just interested in similarity you must use similarity = 1-distance. Using your example

from scipy.spatial.distance import cosine
import numpy as np

word_weights = {'something': 2}
feature_names = vectorizer.get_feature_names()
weights = np.ones(len(feature_names))

for key, value in word_weights.items():
    weights[feature_names.index(key)] = value

for vector in trainVectorizerArray:
    print(vector)
    for testV in testVectorizerArray:
        print(testV)
        cosine_unweight = cosine(vector, testV)
        cosine_weighted = cosine(vector, testV, w=weights)
        print(cosine_unweight, cosine_weighted)

As requested a bit more of an explanation for the word_weights dictionary. It's the weight that you're assigning to the other words. Each of weights is set to 1 unless you add an entry into the word_weights dictionary, so word_weights = {'test': 0} would remove the "test" from the cosine similarity, but word_weights = {'test': 1.5} would increase the weighting by 50% compared to other words. You can also include multiple entries if you need too, for example word_weights = {'test': 1.5, 'something': 2} will adjust the weighting of both "test" and "something" compared to the other words.

Upvotes: 5

Related Questions