Java library method or algorithm to estimate aggregate string similarity?

Question

I have responses from users to multiple choice questions, e.g. (roughly):

Married/Single
Male/Female
American/Latin American/European/Asian/African

What I want is to estimate similarity by aggregating all responses into a single field which can be compared across users in the database - rather than running queries against each column.

So, for example, some responses might look like:

Married-Female-American
Single-Female-European

But I don't want to store a massive text object to represent all of the possible concatenated responses since there are maybe 50 of them.

So, is there some way to represent a set of responses more concisely using a Java library method of some kind.

In other words, this method would take Married-Female-American and generate a code, say of abc while Single-Female-European would generate a code of, say, def?

This way if I want to find out if two users are Married-Female-Americans I can simply query a single column for the code abc.

alf · Accepted Answer

Well, if it was a multiple choice question, you have choices enumerated. That is, numbered. Why not use 1-1-2 and 23-1-75 then? Even if you have 50 answers, it's still manageable.

Now if you happen to need the similarity, aggregating is the last thing you want. What you want is a simple array of ids of the answers given and a function defining a distance between two answer arrays. Do not use Strings, do not aggregate. Leave clean nice vectors, and all the ML libraries will be at your service.

To quote a Java ML library, try http://www.cs.waikato.ac.nz/~ml/weka/

Update: One more thing you may want to try is locality sensitive hashing. I don't think it's a good idea in your case, but your question looks like a request for it. Give it a try.

Java library method or algorithm to estimate aggregate string similarity?

Answers (2)

Related Questions