user7174053
user7174053

Reputation:

How to sort by best match?

I have a query where I want results to match as close to the conditions as possible.

For example If I have a condition "a" = 500, then the returned results should be sorted such that I'll get 500, 499, 501, 498, 502 and so on..expanding from the provided value to both(positive and negative) sides like a tree.

This is easy with something like select val, abs(500 - val) as num from foo order by num asc, val asc limit 10;

Now what I want to achieve is to apply multiple such conditions(sorts), which is not an issue BUT what I want in the end is to get the best match against ALL provided conditions.

Just adding these sorts would mean that the results would get properly sorted by the first field, then if there are duplciates on each value by the second field and so on, which would primary mean the first sort is the one that dictates the order.

What I have in mind is to have each of these sort have a "weight" and the result should be sorted by all weights computed together.

So for example if one record matches first sort by difference of 2(I'm looking for 500 but record has 488) and second sort by difference of 100(I'm looking for 200 and record has value 100) and second record matches first sort by 1(I'm looking for 500 and record has value 501) and second sort by difference of 105(I'm looking for 200 and record has value of 305) the second record would be first according to the first sort(since 1 is less than 2) but the first record, even if on the first sort differs by 2 on the second sort differs by 100 compared to 105 for the second record. So the first record actually matches the criteria more than the second record.

Therefore just simply counting differences together is not a good approach(since each sort and difference to it has different weight). So I wonder what would be a correct solution to this?

This was quite hard to explain in words so if it is still not clear let me know and I'll try to explain in differently somehow.

EDIT: just to be clear, there is no standard unit for the values. They are different units, numbers, scales... as I've mentioned the weighing. I think percentage has to come in somewhere. Something like select val, valB, ((abs(500 - val) / (500 / 100)) + (abs(200 - valB) / (200 / 100))) as rank from foo order by rank asc;

Upvotes: 0

Views: 935

Answers (2)

Sergey Kalinichenko
Sergey Kalinichenko

Reputation: 726689

Consider each record a point in K-dimensional space, with attributes {val1, val2, ..., valk} Consider the desired combination of values another point, with attributes {search1, search2, ..., searchk}

One approach to sorting the points is by their distance to the search point, i.e. square root of the sum of squared pairwise differences:

ORDER BY 
    POW(val1-search1, 2)
+   POW(val2-search2, 2)
+   ...
+   POW(valK-searchK, 2)

This is the formula for squared Euclidean distance in K dimensions. We do not need to take square root, because we use the distance only for ordering, while the actual value is discarded.

If one field is in meters and another field is in kilometers (or currency, or liters or any other unit, if any) then this would not work

You will need to "homogenize" your space by introducing weights. For measures of the same kind, say, meters and kilometers, this is done by setting the weight for meters at 10-6, or setting the weight for kilometers at 106.

For measures of different kind, e.g. meters and currencies, you would need to decide how much worth you wish to assign to each meter, and use the square of the coefficient as the corresponding weight.

Upvotes: 2

Gordon Linoff
Gordon Linoff

Reputation: 1270051

Dasblinkenlight's solution uses standard Euclidean distance. There is a lot of work in statistics and mathematics on metrics suitable for such differences.

Another method is Manhattan Distance. This is simply the sum of the squares of the absolute values:

order by (abs(val1 - search1) +
          abs(val2 - search2) +
          . . .
          abs(valk - searchk)
         )

Depending on the situation, a statistical measure such as chi-square or Pearson correlation might be appropriate.

In addition, both this and the Euclidean version assume that the different dimensions have similar scales. In practice, you might want to standardize the values (subtract the average and divide by the standard deviation), so all dimensions have similar ranges.

Upvotes: 0

Related Questions