Calculating similarity between and centroid of Lucene documents

In order to perform a simple clustering algorithm on results that I get from Lucene, I have to calculate Cosine similarity between 2 documents in Lucene, I also need to be able to make a centroid document to represent the centroid of each cluster.

All I can think of doing is building my own Vector Space model with tf-idf weighting, using the TermFreqVectors and Overall Term frequencies to populate it.

My question is: This is not an efficient approach, is there a better way to do this?

This feels a little unclear so any suggestions on how I can improve my question are also appreciated.

Upvotes: 2

Answers (3)

ikel

Reputation: 1978

in order to get similarity of one document to the other, why not make a one query with the content of one document and run query against index? that way, you will get score(cosine similarity values)

Upvotes: 0

Mark

Reputation: 312

The short answer is: No.

I have spent a lot of time (way way too much) looking into this, and as far as I can see, you can make your own Vector Space Model and work from that, or use Mahout to generate a Mahout Vector, which you can make comparisons between documents from. I am gonna go ahead and make my own, so I'm marking this question answered!

Upvotes: 0

Yuval F

Reputation: 20621

Mark, you may find Integrating Mahout with Lucene, IR Math with Java or Vector Space Classifier Using Lucene useful.

Upvotes: 1

Calculating similarity between and centroid of Lucene documents

Answers (3)

Related Questions