Reputation: 1
Problem description: I am trying to perform an efficient multi-attribute similarity search across millions of records in a database. However, my process requires an hierarchical order of criteria for each matching case.
For example, consider that my records look like this:
Item(name, number, user)
And I have a variety of matching criteria with hierarchical order:
name
and number
name
and user
I need to find an efficient way to match a user request with my data.
So far I have build a process using Python (because I am most familiar with), where I first load my entire database (only the tables I need though which are still a couple million rows) in a in-memory SQLite
DB. I chose to do that so I do not increase traffic in my main DB during this process.
Then, per user request, I iterate through each criterion in order to find a proper match. The main problem is that for each criterion I have to perform a READ operation on the in-memory DB in order to get the matches that account for the specific criterion (I try to utilize SQL as much as possible) and then perform any post-processing using Python if needed. Also, if a record is matched with any Criterion, we continue to the next one (Since criteria are in hierarchical order!)
Note, that I run each user request in parallel to be more efficient.
Thus, far I have managed to process 100 requests in ~1 Hour using this method but this is not as efficient as I would like it to be.
I also researched vector DBs and other tools (Pandas
, Polars
, DuckDB
) but I am not sure whether they can be used for my use-case.
Any tips on how to improve my process are welcome!
Upvotes: 0
Views: 72