Reputation: 319
I have a CSV file containing data of a questionnaire with 14000 rows. The questionnaire has MCQ-Multiple Response(M10,M13). For MCQ-MR, like in M13 there are 8 choices, if the respondent chooses some choice it is denoted as 1 otherwise it is denoted as 0. I would like to generate a similarity score for each bit string and replace that with bit strings. The score should be calculated in such a way like 00010011
and 00100011
are more similar as the respondent has chosen same choices except for the third and fourth choice so there score must be nearer as compared to 00010011
and 00000001
.
M10,M13
1111000100001000,00000001
101010000001000,00000001
111010000001000,00010011
110010000001100,00100011
This thread gives some insight about Levenshtein distance which compares between two strings. But for 14000 rows it will be huge computational burden. Is there any other method to do it?
Upvotes: 0
Views: 326
Reputation:
Levenshtein edit distance isn't what you want here. It'd consider A=101010 and B=010101 to be very similar, because you can turn A into B by adding a 0 at the beginning and removing the 1 at the end. You'd probably rather that they be considered maximally different, though, because they differ at every position.
So what you want is simply the symmetric difference of the bit strings. Perform a bitwise XOR on the two bit strings and count the 1 bits in the result -- each one corresponds to a bit which differed between the two.
Upvotes: 2