Multi-attribute similarity search across millions or records based on criteria

Question

Problem description: I am trying to perform an efficient multi-attribute similarity search across millions of records in a database. However, my process requires an hierarchical order of criteria for each matching case.

For example, consider that my records look like this:

Item(name, number, user)

And I have a variety of matching criteria with hierarchical order:

Criterion A: Match with records with the same name and number
Criterion B: Match with records with similar name and user

I need to find an efficient way to match a user request with my data.

So far I have build a process using Python (because I am most familiar with), where I first load my entire database (only the tables I need though which are still a couple million rows) in a in-memory SQLite DB. I chose to do that so I do not increase traffic in my main DB during this process.

Then, per user request, I iterate through each criterion in order to find a proper match. The main problem is that for each criterion I have to perform a READ operation on the in-memory DB in order to get the matches that account for the specific criterion (I try to utilize SQL as much as possible) and then perform any post-processing using Python if needed. Also, if a record is matched with any Criterion, we continue to the next one (Since criteria are in hierarchical order!)

Note, that I run each user request in parallel to be more efficient.

Thus, far I have managed to process 100 requests in ~1 Hour using this method but this is not as efficient as I would like it to be.

I also researched vector DBs and other tools (Pandas, Polars, DuckDB) but I am not sure whether they can be used for my use-case.

Any tips on how to improve my process are welcome!

Multi-attribute similarity search across millions or records based on criteria

Answers (0)

Related Questions