Return best matches for the input query having several features in python

Question

I have items as follows:

[
    { "id":"item1", "age": 1, "color": 'fff', "rate": 3 },
    { "id":"item2", "age": 2, "color": '000', "rate": 4 },
    { "id":"item3", "age": 3, "color": 'eee', "rate": 5 },
    { "id":"item4", "color": 'bbb', "rate": 5 }
]

Now, I expect user to search for a desirable item: {"age": 1, "color": '000', "rate":5} or even {"age": 3, "color": 'abc'}

I would like to find the best match(es) for this query. How do I do that? I'm not looking for an exact answer. Although, I'm interested to implement it as a backend service, so Python should be fine. I'm just not sure how to tackle the problem. Is there some sort of a matching algorithm or a fuzzy search I need to look at?

UPDATE: the data is big (millions of items), there are 50-100 keys for each items, but some items might not have them all. Also the user queries might not contain all keys.

Moses Koledoye · Accepted Answer

How large is your dataset?

For a small dataset, you could do this in O(n*m) time (n items in list, m keys in dict) by simply taking the item that has the most matches after comparing values at same keys, keeping track of number of matches.

search_item = {"age": 1, "color": '000', "rate": 5}
mx = -float('inf')
for item in lst:
    curr = sum(search_item[k]==item[k] for k in item)
    if curr > mx:
        match = item
        mx = curr
print(match)

The search criterion might not be a simple key-value matching. You get to define that!

For a very large dataset, you could instead construct a k-d tree from your list and cut down search time to O(log(n)) should in case you'll be reusing the list/tree to search for multiple items.

You should convert the hex colors to numeric type so you have an homogenous int type across dimensions and comparison is easier, as comparing ints is way simpler than fuzzy string matching.

For example, the color ffb is closer to fff than eee:

>>> int('fff', 16)
4095
>>> int('ffb', 16)
4091
>>> int('eee', 16)
3822

Return best matches for the input query having several features in python

Answers (2)

Related Questions