Reputation: 810
I have a dataset of 25 integer fields and 40k records, e.g.
1:
field1: 0
field2: 3
field3: 1
field4: 2
[...]
field25: 1
2:
field1: 2
field2: 1
field3: 4
field4: 0
[...]
field25: 2
etc.
I'm testing with MySQL but am not tied to it.
Given a single record, I need to retrieve the records most similar to it; something like the lowest average difference of the fields. I started looking at the following, but I don't know how to map this onto the problem of searching for similarities in a large dataset.
Upvotes: 2
Views: 1554
Reputation: 49
I know it's an old post, but for anyone who comes by it seeking similar algorithms, one that works particularly well is Cosine Similarity. Find a way to vectorize your records, then look for vectors with minimum angle between them. If vectorizing a record is not trivial, then you can vectorize similarity between them via some known algorithm, and then look at cosine similarity of the similarity vectors to the perfect match vector (assuming perfect matches aren't the goal since they're easy to find anyway). I get tremendous results with this matching even comparing things like lists of people in various countries working on a particular project with various contributions to the project. Vectorization implies looking at number of country matches, country mismatches, ratio of people in a matching country between two datasets, etc etc etc. I use string edit distance functions like Levenshtein distance for getting numeric value from string dissimilarities, but one could use phonetic matching, etc. As long as the target number is not 0 (vector [0 0 ... 0] is the subspace of ANY vector and thus its angle would be undefined. Sometimes to get away from the problem, such as the case of edit distance, I give a perfect match (e.d. 0) a negative weight, so that perfect matches are really emphasized. -1 and 1 are farther away than 1 and 2, which makes a lot of sense - perfect match is better than anything with even 1 misspelling.
Cos(theta) = (A dot B) / (Norm(A)*Norm(B)) where dot is the dot-product, and Norm is the Euclidian magnitude of the vector.
Good luck!
Upvotes: 4
Reputation: 810
Here's a possibility with straight average distance between each of the fields (the value after each minus is from the given record needing a match):
SELECT id,
(
ABS(field1-2)
+ ABS(field2-2)
+ ABS(field3-3)
+ ABS(field4-1)
+ ABS(field5-0)
+ ABS(field6-3)
+ ABS(field7-2)
+ ABS(field8-0)
+ ABS(field9-1)
+ ABS(field10-0)
+ ABS(field11-2)
+ ABS(field12-2)
+ ABS(field13-3)
+ ABS(field14-2)
+ ABS(field15-0)
+ ABS(field16-1)
+ ABS(field17-0)
+ ABS(field18-2)
+ ABS(field19-3)
+ ABS(field20-1)
+ ABS(field21-0)
+ ABS(field22-1)
+ ABS(field23-3)
+ ABS(field24-2)
+ ABS(field25-2)
)/25
AS distance
FROM mytable
ORDER BY distance ASC
LIMIT 20;
Upvotes: 0