Reputation: 109
I have responses from users to multiple choice questions, e.g. (roughly):
Married/Single
Male/Female
American/Latin American/European/Asian/African
What I want is to estimate similarity by aggregating all responses into a single field which can be compared across users in the database - rather than running queries against each column.
So, for example, some responses might look like:
Married-Female-American
Single-Female-European
But I don't want to store a massive text object to represent all of the possible concatenated responses since there are maybe 50 of them.
So, is there some way to represent a set of responses more concisely using a Java library method of some kind.
In other words, this method would take Married-Female-American
and generate a code, say of abc
while Single-Female-European
would generate a code of, say, def
?
This way if I want to find out if two users are Married-Female-Americans
I can simply query a single column for the code abc
.
Upvotes: 2
Views: 442
Reputation: 77474
Do you have a finite number of options (multiple-choice seems to imply this)?
It is a common technique for performance to go from strings to a numerical data set, by essentially indexing the available strings. As long as you only need identity, this is perfect. Comparing an integer is much faster than comparing a string, and they usually take less memory, too.
A character is essentially an integer in 0-255
, so you can of course use this.
So just define an alphabet:
a Married
b Single
c Male
d Female
e American
f Latin American
g European
h Asian
i African
You can in fact use this even when you have more than 256 words, if they are positional (and no single question has more than 256 choices). You would then use
a Q1: Married
b Q1: Single
a Q2: Male
b Q2: Female
a Q3: American
b Q3: Latin American
c Q3: European
d Q3: Asian
e Q3: African
Your examples would then be encoded as either (variant 1) ade
and bdg
or (variant 2) aba
and bbc
. The string should then have a fixed length of 50 (if you have 50 questions) and can be stored very effectively.
For comparing answers, just access the n
th character of the string. Maybe your database allows for indexed substring queries, too. As you can see in above example, both strings agree only on the second character, just like the answers agreed.
Upvotes: 1
Reputation: 8513
Well, if it was a multiple choice question, you have choices enumerated. That is, numbered. Why not use 1-1-2
and 23-1-75
then? Even if you have 50 answers, it's still manageable.
Now if you happen to need the similarity, aggregating is the last thing you want. What you want is a simple array of ids of the answers given and a function defining a distance between two answer arrays. Do not use Strings, do not aggregate. Leave clean nice vectors, and all the ML libraries will be at your service.
To quote a Java ML library, try http://www.cs.waikato.ac.nz/~ml/weka/
Update: One more thing you may want to try is locality sensitive hashing. I don't think it's a good idea in your case, but your question looks like a request for it. Give it a try.
Upvotes: 6