testinggnitset ser
testinggnitset ser

Reputation: 297

Finding similar products by attributes

I've heard about K nearest neighbor to find a category that an item belongs to but I was wondering if there was an algorithm that will return a list of items based on attributes.

For example given a movie

[director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Dec 1776"]

The results will return

[director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Jan 1996"]

instead of

[director: "Hill Thompson", starring-actor: "Poop Jenkins", release-date: "Sept 1822"]

Because the former result matches more attributes "Hill Thompson" and "Will Smith" whereas the former only has one match - Hill Thompson.

Would Cosine Similarity be a good method to solve this problem?

Upvotes: 3

Views: 2203

Answers (1)

Anand Undavia
Anand Undavia

Reputation: 3543

Would Cosine Similarity be a good method to solve this problem?

Yes. It will be good, but with TF-IDF

The most used similarity measures are Jaccard Similarity and Cosine similarity. In the scenario given, you can directly use Jaccard Similarity and obtain the results you want.

Say,

A = {director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Dec 1776"}
B = {director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Jan 1996"}
C = {director: "Hill Thompson", starring-actor: "Poop Jenkins", release-date: "Sept 1822"}
D = {director: "Foo Bar", starring-actor: "Poop Jenkins", release-date: "Some date"}

The Jaccard Similarity will be :

J(A,B) = 2 / 4 = 0.5
J(A,C) = 1 / 5 = 0.2
J(C,D) = 1 / 5 = 0.2

And as J(A,B) > J(A,C) the K nearest neighbour method will pick B first and then C. In such cases, the Jaccard similarity captures the intuition well.

To demonstrate how Cosine Similarity is better, add one more attribute :

A = {place filmed : "A", director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Dec 1776"}
B = {place filmed : "A", director: "Hill Thompson", starring-actor: "Will Smith", release-date: "Jan 1996"}
C = {place filmed : "A", director: "Hill Thompson", starring-actor: "Poop Jenkins", release-date: "Sept 1822"}
D = {place filmed : "A", director: "Foo Bar", starring-actor: "Poop Jenkins", release-date: "Some date"}


J(A,B) = 3 / 5 = 0.6
J(A,C) = 2 / 6 = 0.33
J(C,D) = 2 / 6 = 0.33

Notice that the J(C,A) = J(C,D)

Which is wrong intuition.

Why? Because place A seems to be common place for recording movies. Just because two movies are recorded at the same place, we can not conclude that they are similar. So ideally it should be Sim(C,D) > Sim(C,A). Such are the cases where Jaccard Similarity fails to capture the intuition and where Cosine similarity with TF-IDF outperforms.

The problem with Cosine Similarity in such cases is the implementation. Cosine similarity is defined on vectors. When the data is not numeric, it is hard to create a vector.

One way to create a vector is vector of boolean.

For example, The vector would be formed as :

vector = [A,HillThompson,FooBar,WillSmith,Poop Jenkins,Dec 1776,Jan 1996, Sept 1822, Some date]

The vectors would be :

A = {1,1,0,1,0,1,0,0,0}
C = {1,1,0,0,1,0,0,1,0}
D = {1,0,1,0,1,0,0,0,1}

J(C,A) = 5 / 12
J(C,D) = 5 / 12

Note that the Jaccard Similarity still captures wrong intuition. And so would Cosine Similarity if TF-IDF is not done.

Now calculate TF-IDF :

IDF(A)              = log( 1 + 4 / 4) = 0.30

IDF(HillThompson)   = log( 1 + 4 / 3) = 0.37
IDF(FooBar)         = log( 1 + 4 / 1) = 0.70

IDF(WillSmith)      = log( 1 + 4 / 2) = 0.48
IDF(Poop Jenkins)   = log( 1 + 4 / 2) = 0.48

IDF(Dec 1776)       = log( 1 + 4 / 1) = 0.70
IDF(Jan 1996)       = log( 1 + 4 / 1) = 0.70
IDF(Sept 1822)      = log( 1 + 4 / 1) = 0.70
IDF(Some date)      = log( 1 + 4 / 1) = 0.70

IF-IDF vectors would now be :

A = {0.30/4, 0.37/4, 0,      0.48/4, 0,       0.70/4, 0, 0,      0}
C = {0.30/4, 0.37/4, 0,      0,      0.48/4,  0,      0, 0.70/4, 0}
D = {0.30/4,      0, 0.70/4, 0,      0.48/4,  0,      0, 0,      0.70/4}

A = {0.075,  0.0925, 0,      0.12,   0,       0.175,  0, 0,     0 } 
C = {0.075,  0.0925, 0,      0,      0.12,    0,      0, 0.175, 0 }
D = {0.075,  0,      0.175,  0,      0.12,    0,      0, 0,     0.175 }

|A| = 0.2433
|C| = 0.2433
|D| = 0.2850

Calculate the cosine similarity :

Cosine(A,C) = 0.01418 / ( 0.2433 * 0.2433 ) = 0.2395
Cosine(C,D) = 0.0200  / ( 0.2492 * 0.2850 ) = 0.2816

Thus, the Cosine similarity with TF-IDF captures the intuition that D is more similar to C than A is to C. And thus it is better than Jaccard similarity

PLEASE NOTE that I have shown calculations becuase I have done them on PC and not on scientific calculator. There might be chances of errors. Just in case you find one, please rectify it.

Upvotes: 4

Related Questions