Rookie
Rookie

Reputation: 5487

How do I use Cosine similarity for this use case?

If I have a query vector A and an item vector B, it would be great if someone can guide me how to weigh/normalize the vectors (strategies for the same). Vector A would have the following components ( property1 (binary), property2 (binary), property 3 (int from range 0 to 50), property4 (int from range(0 to 10)

Vector B would have the same properties

I know that the angle between these 2 vectors using cosine similarity would give me the distance between the 2 vectors. I want to create a recommendation based on the similarity.

But i am not clear on how to normalize the properties and or the vectors in this case since it is binary+binary_int range +int range. Also, if I want to grant higher weightage to one property than the other, how do i do so. what options do i have.

I find examples of cosine similarity online with documents, but in this case the Vectors A and B are not documents so i am not using TF-idf in this case.

Please advise,

Thanks

Upvotes: 1

Views: 507

Answers (1)

Ian
Ian

Reputation: 861

If you want to use the traditional cosine similarity between the two vectors for td/idf, then each term is a dimension in your vector. That is, you need to form two new Vectors A' and B' and perform the similarity between these two.

These vectors have a dimension for each term, and you have 65 terms:

property 1: true and false
property 2: true and false
property 3: 0 through 50
property 4: 0 through 10

So A' and B' will be vectors of length 65 and each element will be either 0 or 1:

A'(0) = 1 if A(0) = true, and 0 otherwise
A'(1) = 1 if A(0) = false, and 0 otherwise
etc.

Clearly, you can see that this is inefficient. You don't actually need to calculate A' or B' to use cosine similarity with td/idf; you can just pretend you calculated them and perform the calculation on A and B. Note that length(A') = length(B') = sqrt(4) because there will be exactly 4 ones in A' and B'.

td/idf may not be your best bet though, if you want to take care of similarities within properties 3 and 4. That is, with td/idf, a property 3 value of 40 is different than a property 3 value of 41 and different than a property 3 value of 12. However, 41 is not considered "farther away" from 40 than 12; they are all just different terms.

So, if you want property 3 and 4 to incorporate a distance (1 is really close to 2 and 50 is far form 2) then you have to define a distance metric. And if you want to weigh the Boolean values more or less than properties 3 and 4, you will have to define a different distance metric too. If these are things you want to do, forget about cosine and just come up with a value.

Here's an example:

distance = abs(A.property1 - B.property1) * 5 + 
           abs(A.property2 - B.property2) * 5 + 
           abs(A.property3 - B.property3) / 51 * 1 +
           abs(A.property4 - B.property4) / 10 * 2

And then the similarity = (the maximum of all distances) - distance;

Or, if you like, similarity = 1 / distance.

You can really define it how ever you like. And if you need the similarity to be between 0 and 1, then normalize by dividing by the maximum possible distance.

Upvotes: 1

Related Questions