towi_parallelism
towi_parallelism

Reputation: 1471

Return best matches for the input query having several features in python

I have items as follows:

[
    { "id":"item1", "age": 1, "color": 'fff', "rate": 3 },
    { "id":"item2", "age": 2, "color": '000', "rate": 4 },
    { "id":"item3", "age": 3, "color": 'eee', "rate": 5 },
    { "id":"item4", "color": 'bbb', "rate": 5 }
]

Now, I expect user to search for a desirable item: {"age": 1, "color": '000', "rate":5} or even {"age": 3, "color": 'abc'}

I would like to find the best match(es) for this query. How do I do that? I'm not looking for an exact answer. Although, I'm interested to implement it as a backend service, so Python should be fine. I'm just not sure how to tackle the problem. Is there some sort of a matching algorithm or a fuzzy search I need to look at?

UPDATE: the data is big (millions of items), there are 50-100 keys for each items, but some items might not have them all. Also the user queries might not contain all keys.

Upvotes: 3

Views: 347

Answers (2)

Moses Koledoye
Moses Koledoye

Reputation: 78546

How large is your dataset?

For a small dataset, you could do this in O(n*m) time (n items in list, m keys in dict) by simply taking the item that has the most matches after comparing values at same keys, keeping track of number of matches.

search_item = {"age": 1, "color": '000', "rate": 5}
mx = -float('inf')
for item in lst:
    curr = sum(search_item[k]==item[k] for k in item)
    if curr > mx:
        match = item
        mx = curr
print(match)

The search criterion might not be a simple key-value matching. You get to define that!

For a very large dataset, you could instead construct a k-d tree from your list and cut down search time to O(log(n)) should in case you'll be reusing the list/tree to search for multiple items.

You should convert the hex colors to numeric type so you have an homogenous int type across dimensions and comparison is easier, as comparing ints is way simpler than fuzzy string matching.

For example, the color ffb is closer to fff than eee:

>>> int('fff', 16)
4095
>>> int('ffb', 16)
4091
>>> int('eee', 16)
3822

Upvotes: 1

timgeb
timgeb

Reputation: 78690

I am assuming that you want the element of data that matches best, not the best match for each key accross all the dicts.

This could get you started:

>>> data = [
...     { "id":"item1", "age": 1, "color": 'fff', "rate": 3 },
...     { "id":"item2", "age": 2, "color": '000', "rate": 4 },
...     { "id":"item3", "age": 3, "color": 'eee', "rate": 5 }
... ]
>>> user_input = {"age": 1, "color": 'fff', "rate":5}
>>>
>>> criterion = lambda d: len(user_input.items() & d.items())
>>> max(data, key=criterion)
{'id': 'item1', 'age': 1, 'color': 'fff', 'rate': 3}

The call to max returns the only element of data with two matches here.

If you want more sophisticated fuzzy matching than just counting direct hits, for example 'ffe' being closer to 'fff' than 'abc' then you need to

  • define a distance metric for each type of value associated with certain keys
  • use these metrics to implement a more sophisticated criterion.

For strings, consider the Levenshtein distance and abs(x - y) for numeric types.

Upvotes: 2

Related Questions