Reputation: 177
As an example, let's say I have a very simple data set. I am given a csv with three columns, user_id, book_id, rating. The rating can be any number 0-5, where 0 means the user has NOT rated the book.
Let's say I randomly pick three users, and I get these feature/rating vectors.
Martin: <3,3,5,1,2,3,2,2,5>
Jacob: <3,3,5,0,0,0,0,0,0>
Grant: <1,1,1,2,2,2,2,2,2>
The similarity calculations:
+--------------+---------+---------+----------+
| | M & J | M & G | J & G |
+--------------+---------+---------+----------+
| Euclidean | 6.85 | 5.91 | 6.92 |
+--------------+---------+---------+----------+
| Cosine | .69 | .83 | .32 |
+--------------+---------+---------+----------+
Now, my expectation of similarity is that Martin and Jacob would be the most similar. I would expect this because they have EXACTLY the same ratings for the books that both of them have rated. But we end up finding that Martin and Grant are the most similar.
I understand mathematically how we get to this conclusion, but I don't understand how I can rely on Cosine Angular distance or Euclidean distance as a means of calculating similarity, if this type of thing occurs. For what interpretation are Martin and Grant more similar than Martin and Jacob?
One thought I had was to just calculate Euclidean distance, but ignore all books for which one user hasn't rated the book.
I then end up with this
+--------------+---------+---------+----------+
| | M & J | M & G | J & G |
+--------------+---------+---------+----------+
| Euclidean | 0 | 5.91 | 6.92 |
+--------------+---------+---------+----------+
| Cosine | .69 | .83 | .32 |
+--------------+---------+---------+----------+
Of course now I have a Euclidean distance of 0, which fits what I would expect of the recommender system. I see many tutorials and lectures use Cosine Angular distance to ignore the unrated books, rather than use Euclidean and ignore them, so I believe this must not work in general.
EDIT:
Just to experiment a little, I adjusted Jacob's feature vector to be much more similar:
Jacob: <3,3,5,1,2,3,2,0,0>
When I calculate Cosine Angular distance with Martin, I still only get .82! Still less similar than Martin and Grant, yet by inspection I would expect these two to be very similar.
Could somebody help explain where my thinking is wrong, and possibly suggest another similarity measure?
Upvotes: 1
Views: 1255
Reputation: 1024
Your thinking is correct, but your code might calculate the cosine similarity incorrectly.
Kris already gave you a correct answer, but I want to point out that when you calculate the cosine similarity, you didn't skip those unrated items. As we can see that the cosine similarity from the first and second tables are exactly the same. This is probably a bug in your code.
Upvotes: 0
Reputation: 5792
As you have noted yourself Euclidean and Cosine Angular are based on distance. The distance between 3 and 5 for example is much smaller than between 3 and 0, having multiple zeros in Jacob's ratings you won't get much similarity between Jacob and Martin. The main problem with your example is that you assumed that 0 means no rating where in effect its interpreted by the two formulas as rating 0 (which is the lowest rating possible) If you would skip the zero ratings and compare the users only on the ratings they have in common than Marin and Jacob would have similarity of 1!
Upvotes: 2