Milkmate
Milkmate

Reputation: 109

Java library method or algorithm to estimate aggregate string similarity?

I have responses from users to multiple choice questions, e.g. (roughly):

Married/Single
Male/Female
American/Latin American/European/Asian/African

What I want is to estimate similarity by aggregating all responses into a single field which can be compared across users in the database - rather than running queries against each column.

So, for example, some responses might look like:

Married-Female-American
Single-Female-European

But I don't want to store a massive text object to represent all of the possible concatenated responses since there are maybe 50 of them.

So, is there some way to represent a set of responses more concisely using a Java library method of some kind.

In other words, this method would take Married-Female-American and generate a code, say of abc while Single-Female-European would generate a code of, say, def?

This way if I want to find out if two users are Married-Female-Americans I can simply query a single column for the code abc.

Upvotes: 2

Views: 442

Answers (2)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77474

Do you have a finite number of options (multiple-choice seems to imply this)?

It is a common technique for performance to go from strings to a numerical data set, by essentially indexing the available strings. As long as you only need identity, this is perfect. Comparing an integer is much faster than comparing a string, and they usually take less memory, too.

A character is essentially an integer in 0-255, so you can of course use this.

So just define an alphabet:

a Married
b Single
c Male
d Female
e American
f Latin American
g European
h Asian
i African

You can in fact use this even when you have more than 256 words, if they are positional (and no single question has more than 256 choices). You would then use

a Q1: Married
b Q1: Single
a Q2: Male
b Q2: Female
a Q3: American
b Q3: Latin American
c Q3: European
d Q3: Asian
e Q3: African

Your examples would then be encoded as either (variant 1) ade and bdg or (variant 2) aba and bbc. The string should then have a fixed length of 50 (if you have 50 questions) and can be stored very effectively.

For comparing answers, just access the nth character of the string. Maybe your database allows for indexed substring queries, too. As you can see in above example, both strings agree only on the second character, just like the answers agreed.

Upvotes: 1

alf
alf

Reputation: 8513

Well, if it was a multiple choice question, you have choices enumerated. That is, numbered. Why not use 1-1-2 and 23-1-75 then? Even if you have 50 answers, it's still manageable.

Now if you happen to need the similarity, aggregating is the last thing you want. What you want is a simple array of ids of the answers given and a function defining a distance between two answer arrays. Do not use Strings, do not aggregate. Leave clean nice vectors, and all the ML libraries will be at your service.

To quote a Java ML library, try http://www.cs.waikato.ac.nz/~ml/weka/

Update: One more thing you may want to try is locality sensitive hashing. I don't think it's a good idea in your case, but your question looks like a request for it. Give it a try.

Upvotes: 6

Related Questions