Find similar records in dataset

Question

I have a dataset of 25 integer fields and 40k records, e.g.

1:
  field1: 0
  field2: 3
  field3: 1
  field4: 2
  [...]
  field25: 1
2:
  field1: 2
  field2: 1
  field3: 4
  field4: 0
  [...]
  field25: 2

etc.

I'm testing with MySQL but am not tied to it.

Given a single record, I need to retrieve the records most similar to it; something like the lowest average difference of the fields. I started looking at the following, but I don't know how to map this onto the problem of searching for similarities in a large dataset.

yitznewton · Accepted Answer

Here's a possibility with straight average distance between each of the fields (the value after each minus is from the given record needing a match):

SELECT id,
(
  ABS(field1-2)
  + ABS(field2-2)
  + ABS(field3-3)
  + ABS(field4-1)
  + ABS(field5-0)
  + ABS(field6-3)
  + ABS(field7-2)
  + ABS(field8-0)
  + ABS(field9-1)
  + ABS(field10-0)
  + ABS(field11-2)
  + ABS(field12-2)
  + ABS(field13-3)
  + ABS(field14-2)
  + ABS(field15-0)
  + ABS(field16-1)
  + ABS(field17-0)
  + ABS(field18-2)
  + ABS(field19-3)
  + ABS(field20-1)
  + ABS(field21-0)
  + ABS(field22-1)
  + ABS(field23-3)
  + ABS(field24-2)
  + ABS(field25-2)
)/25
AS distance 
FROM mytable
ORDER BY distance ASC
LIMIT 20;

Find similar records in dataset

Answers (2)

Related Questions