Reputation: 9427
I have a document that I tokenized, and then I take another document and I compare the two by calculating their cosine similarity.
However, before I calculate their similarity, I want to increase the weight of one of the words beforehand. I'm thinking of doing this by doubling the count of that word, but I do not know how to do that.
Suppose I have the following...
text = [
"This is a test",
"This is something else",
"This is also a test"
]
test = ["This is something"]
Next I define the stop words and call CountVectorizer
for both sets of documents.
stopWords = set(stopwords.words('english'))
vectorizer = CountVectorizer(stop_words=stopWords)
trainVectorizerArray = vectorizer.fit_transform(text).toarray()
testVectorizerArray = vectorizer.transform(test).toarray()
In the next part I calculate the cosine similarity...
cosine_function = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)
for vector in trainVectorizerArray:
print(vector)
for testV in testVectorizerArray:
print(testV)
cosine = cosine_function(vector, testV)
print(cosine)
However, before I calculate the similarity, how can I increase the weight of one of the words. Suppose in this example I want to increase the weight of something
, how can I do that? I think you do this by increasing the word count but I do not know how to increase that.
Upvotes: 4
Views: 2748
Reputation: 5355
I think the easiest way would be to use the get_feature_names
function for your CountVectorizer
in combination with the cosine
function in scipy.spatial.distance
. But be aware that this computes cosine distance rather than similarity, so if you are just interested in similarity you must use similarity = 1-distance
. Using your example
from scipy.spatial.distance import cosine
import numpy as np
word_weights = {'something': 2}
feature_names = vectorizer.get_feature_names()
weights = np.ones(len(feature_names))
for key, value in word_weights.items():
weights[feature_names.index(key)] = value
for vector in trainVectorizerArray:
print(vector)
for testV in testVectorizerArray:
print(testV)
cosine_unweight = cosine(vector, testV)
cosine_weighted = cosine(vector, testV, w=weights)
print(cosine_unweight, cosine_weighted)
As requested a bit more of an explanation for the word_weights
dictionary. It's the weight that you're assigning to the other words. Each of weights is set to 1
unless you add an entry into the word_weights
dictionary, so word_weights = {'test': 0}
would remove the "test" from the cosine similarity, but word_weights = {'test': 1.5}
would increase the weighting by 50% compared to other words. You can also include multiple entries if you need too, for example word_weights = {'test': 1.5, 'something': 2}
will adjust the weighting of both "test" and "something" compared to the other words.
Upvotes: 5