jodeci
jodeci

Reputation: 996

Measuring similarity between document sets

For illustration purposes, let's assume this is a forum service. I need to calculate the "similarity" among each users' posts, so that the result would be something like:

among posts by user A, similarity 60%
among posts by user B, similarity 20%
...

I'm dealing with multibyte strings, so I guess I'm stuck with search engines here. We already use Solr, already have moreLikeThis implemented, but I'm not quite sure how to construct the query. Any help appreciated!

Upvotes: 7

Views: 1903

Answers (3)

Mikos
Mikos

Reputation: 8563

There are several measures of similarity, a simple and effective one is Cosine similarity. There are more sophisticated ones such as Smith-Waterman etc,

Look at http://sourceforge.net/projects/simmetrics/

Upvotes: 0

Omnaest
Omnaest

Reputation: 3096

Possibly Carrot2 will interest you (and this blog related to it)

Upvotes: 1

D_K
D_K

Reputation: 1420

strange question in two ways: 1. Why do you have to deal with SOLR? 2. The kind of similarity depends on the target problem. Your question sounds too generic to me. There is research going on in the area of semantic similarity. There is edit-distance algorithm, which is probably not what you want.

So, define you question more precisely and you get better answers.

Upvotes: 0

Related Questions