Reputation: 5487
If I have a query vector A and an item vector B, it would be great if someone can guide me how to weigh/normalize the vectors (strategies for the same). Vector A would have the following components ( property1 (binary), property2 (binary), property 3 (int from range 0 to 50), property4 (int from range(0 to 10)
Vector B would have the same properties
I know that the angle between these 2 vectors using cosine similarity would give me the distance between the 2 vectors. I want to create a recommendation based on the similarity.
But i am not clear on how to normalize the properties and or the vectors in this case since it is binary+binary_int range +int range. Also, if I want to grant higher weightage to one property than the other, how do i do so. what options do i have.
I find examples of cosine similarity online with documents, but in this case the Vectors A and B are not documents so i am not using TF-idf in this case.
Please advise,
Thanks
Upvotes: 1
Views: 507
Reputation: 861
If you want to use the traditional cosine similarity between the two vectors for td/idf, then each term is a dimension in your vector. That is, you need to form two new Vectors A' and B' and perform the similarity between these two.
These vectors have a dimension for each term, and you have 65 terms:
property 1: true and false
property 2: true and false
property 3: 0 through 50
property 4: 0 through 10
So A' and B' will be vectors of length 65 and each element will be either 0 or 1:
A'(0) = 1 if A(0) = true, and 0 otherwise
A'(1) = 1 if A(0) = false, and 0 otherwise
etc.
Clearly, you can see that this is inefficient. You don't actually need to calculate A' or B' to use cosine similarity with td/idf; you can just pretend you calculated them and perform the calculation on A and B. Note that length(A') = length(B') = sqrt(4) because there will be exactly 4 ones in A' and B'.
td/idf may not be your best bet though, if you want to take care of similarities within properties 3 and 4. That is, with td/idf, a property 3 value of 40 is different than a property 3 value of 41 and different than a property 3 value of 12. However, 41 is not considered "farther away" from 40 than 12; they are all just different terms.
So, if you want property 3 and 4 to incorporate a distance (1 is really close to 2 and 50 is far form 2) then you have to define a distance metric. And if you want to weigh the Boolean values more or less than properties 3 and 4, you will have to define a different distance metric too. If these are things you want to do, forget about cosine and just come up with a value.
Here's an example:
distance = abs(A.property1 - B.property1) * 5 +
abs(A.property2 - B.property2) * 5 +
abs(A.property3 - B.property3) / 51 * 1 +
abs(A.property4 - B.property4) / 10 * 2
And then the similarity = (the maximum of all distances) - distance;
Or, if you like, similarity = 1 / distance.
You can really define it how ever you like. And if you need the similarity to be between 0 and 1, then normalize by dividing by the maximum possible distance.
Upvotes: 1