Dheya Majid
Dheya Majid

Reputation: 423

Mathematical method for multiple document clustering by Cosine Similarity

Cosine Similarity: is often used when comparing two documents against each other. It measures the angle between the two vectors. If the value is zero the angle between the two vectors is 90 degrees and they share no terms. If the value is 1 the two vectors are the same except for magnitude. Cosine is used when data is sparse, asymmetric and there is a similarity of lacking characteristics.

When I used cosine for two vectors (documents) I will get the results between according to following table

id        Doc1(TF)  Doc2 (TF)
London        5        3
Is            2        2
Nice         10        3
City          0        1

Then get normalization for that to the end. Then, I will get the cosine Cos(v1,v2)= 90%

BUT, If I have 10 documents that mean I have get

Cos(v1,v2)= ? 
Cos(v1,v3)= ?
Cos(v1,v5)= ?
Cos(v1,v6)= ?
Cos(v1,v7)= ?
Cos(v1,v8)= ?
Cos(v1,v9)= ?
Cos(v2,v3)= ? 
Cos(v2,v4)= ?
Cos(v2,v5)= ?

And so o n 

Until 

Cos(v9,v10)= ?

Then I have to compare the results.

Is the any fast method? How can i get the cos to 10 or more documents.

I know how can i get cosine for two Documents But how can i get about more document? I want the mathematical method.

Upvotes: 3

Views: 2342

Answers (2)

Erich Schubert
Erich Schubert

Reputation: 8725

There is a pretty tricky optimization that is easy to overlook.

Most of the time, your vectors will be sparse. If you look at the formula of cosine similarity, note that the vector lengths won't be changing. So you can precompute them.

For the dot product, note that if your vectors are non-zero in 10% of the dimensions, both will be non-zero in just 1% of the dimensions. So you only want to be computing the products in dimensions that are non-zero! In your example, you want to skip the word City, for example.

If you then transpose the data into a column-based layout and drop the zeros there, you can compute this in a distributed manner quite quickly. For example, the document v1 would be missing in the City column. Then you compute the pairwise products in each column, and output them to the corresponding document pair. In the end, you normalize these sums with the total vector lengths. Mahout should be doing it this way already, so you should find details on this approach in a good book on Mahout (sorry, I don't have good pointers).

The key idea for processing large quantities of text is to exploit sparsity. Try to avoid any computation whose value will be 0 anyway.

Upvotes: 1

Programmer
Programmer

Reputation: 6753

Here are some optimizations. As the matrix of pairwise distances is symmetric about the diagnol, only compute the upper triangle of the matrix. Moreover, to do the actual clustering, given that you have the matrix of pairwise distances, you can do this in n-1 iterations. A fast way to compute the matrix of pairwise distances is to use parallel programming (say the GPU). Results have shown that computing pairwise distances on the GPU is 64 faster than the CPU. However, for clustering algo like single link hierarchical clustering, the actual clustering has to be done on the CPU

Upvotes: 0

Related Questions