Reputation: 297
I've heard about K nearest neighbor to find a category that an item belongs to but I was wondering if there was an algorithm that will return a list of items based on attributes.
For example given a movie
[director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Dec 1776"]
The results will return
[director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Jan 1996"]
instead of
[director: "Hill Thompson", starring-actor: "Poop Jenkins", release-date: "Sept 1822"]
Because the former result matches more attributes "Hill Thompson" and "Will Smith" whereas the former only has one match - Hill Thompson.
Would Cosine Similarity be a good method to solve this problem?
Upvotes: 3
Views: 2203
Reputation: 3543
Would Cosine Similarity be a good method to solve this problem?
Yes. It will be good, but with TF-IDF
The most used similarity measures are Jaccard Similarity
and Cosine similarity
.
In the scenario given, you can directly use Jaccard Similarity
and obtain the results you want.
Say,
A = {director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Dec 1776"}
B = {director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Jan 1996"}
C = {director: "Hill Thompson", starring-actor: "Poop Jenkins", release-date: "Sept 1822"}
D = {director: "Foo Bar", starring-actor: "Poop Jenkins", release-date: "Some date"}
The Jaccard Similarity
will be :
J(A,B) = 2 / 4 = 0.5
J(A,C) = 1 / 5 = 0.2
J(C,D) = 1 / 5 = 0.2
And as J(A,B) > J(A,C)
the K nearest neighbour
method will pick B
first and then C
.
In such cases, the Jaccard similarity
captures the intuition well.
To demonstrate how Cosine Similarity
is better, add one more attribute :
A = {place filmed : "A", director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Dec 1776"}
B = {place filmed : "A", director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Jan 1996"}
C = {place filmed : "A", director: "Hill Thompson", starring-actor: "Poop Jenkins", release-date: "Sept 1822"}
D = {place filmed : "A", director: "Foo Bar", starring-actor: "Poop Jenkins", release-date: "Some date"}
J(A,B) = 3 / 5 = 0.6
J(A,C) = 2 / 6 = 0.33
J(C,D) = 2 / 6 = 0.33
Notice that the J(C,A) = J(C,D)
Which is wrong intuition.
Why?
Because place A seems to be common place for recording movies. Just because two movies are recorded at the same place, we can not conclude that they are similar. So ideally it should be Sim(C,D) > Sim(C,A)
. Such are the cases where Jaccard Similarity
fails to capture the intuition and where Cosine similarity
with TF-IDF
outperforms.
The problem with Cosine Similarity
in such cases is the implementation. Cosine similarity
is defined on vectors. When the data is not numeric, it is hard to create a vector.
One way to create a vector is vector of boolean
.
For example,
The vector would be formed as :
vector = [A,HillThompson,FooBar,WillSmith,Poop Jenkins,Dec 1776,Jan 1996, Sept 1822, Some date]
The vectors would be :
A = {1,1,0,1,0,1,0,0,0}
C = {1,1,0,0,1,0,0,1,0}
D = {1,0,1,0,1,0,0,0,1}
J(C,A) = 5 / 12
J(C,D) = 5 / 12
Note that the Jaccard Similarity
still captures wrong intuition. And so would Cosine Similarity
if TF-IDF is not done.
Now calculate TF-IDF :
IDF(A) = log( 1 + 4 / 4) = 0.30
IDF(HillThompson) = log( 1 + 4 / 3) = 0.37
IDF(FooBar) = log( 1 + 4 / 1) = 0.70
IDF(WillSmith) = log( 1 + 4 / 2) = 0.48
IDF(Poop Jenkins) = log( 1 + 4 / 2) = 0.48
IDF(Dec 1776) = log( 1 + 4 / 1) = 0.70
IDF(Jan 1996) = log( 1 + 4 / 1) = 0.70
IDF(Sept 1822) = log( 1 + 4 / 1) = 0.70
IDF(Some date) = log( 1 + 4 / 1) = 0.70
IF-IDF vectors would now be :
A = {0.30/4, 0.37/4, 0, 0.48/4, 0, 0.70/4, 0, 0, 0}
C = {0.30/4, 0.37/4, 0, 0, 0.48/4, 0, 0, 0.70/4, 0}
D = {0.30/4, 0, 0.70/4, 0, 0.48/4, 0, 0, 0, 0.70/4}
A = {0.075, 0.0925, 0, 0.12, 0, 0.175, 0, 0, 0 }
C = {0.075, 0.0925, 0, 0, 0.12, 0, 0, 0.175, 0 }
D = {0.075, 0, 0.175, 0, 0.12, 0, 0, 0, 0.175 }
|A| = 0.2433
|C| = 0.2433
|D| = 0.2850
Calculate the cosine similarity :
Cosine(A,C) = 0.01418 / ( 0.2433 * 0.2433 ) = 0.2395
Cosine(C,D) = 0.0200 / ( 0.2492 * 0.2850 ) = 0.2816
Thus, the Cosine similarity
with TF-IDF
captures the intuition that D
is more similar to C
than A
is to C
. And thus it is better than Jaccard similarity
PLEASE NOTE that I have shown calculations becuase I have done them on PC and not on scientific calculator. There might be chances of errors. Just in case you find one, please rectify it.
Upvotes: 4