Reputation: 125
Hello i have a redis database that contains facial embeddings of 100k+ people. All of these are stored in redis as key-value pairs. For example:
{
"embedding:angelina" : [128-D vector of angelina],
"embedding:emma" : [128-D vector of emma],
"embedding:dicaprio" : [128-D vector of dicaprio]
}
Now I am trying to compare a target-embedding with all of the embeddings in my dataset to find the best match. One way i am trying to do it is to retrieve all keys starting with embedding* expression first. Then, iterate over those embeddings and find the distance with the target-embedding. If the distance is less than the threshold, then we will append it to a list, and then choose the shortest distance from that list. I dont know, but I have a feeling that this is not a best practice. I would be glad if someone could help me find a better approach?
Note: I know ElasticSearch is a great candidate for such tasks, but I need to stick with redis for now.
Upvotes: 0
Views: 1091
Reputation: 184
Iterating over Redis keys by pattern is possible, but it's not a best practice. The Redis docs warn the following:
Warning: consider KEYS as a command that should only be used in production environments with extreme care. It may ruin performance when it is executed against large databases. This command is intended for debugging and special operations, such as changing your keyspace layout. Don't use KEYS in your regular application code. If you're looking for a way to find keys in a subset of your keyspace, consider using SCAN or sets.
Using SCAN
will protect the Redis instance's resources, but it will still take you a long time and many requests to use SCAN to get all the keys in a large dataset.
Some workarounds come to mind, depending on your situation:
angelina
in the value if it's important.embedding:.*
key in the dataset, also use SADD
and SREM
to add or remove the key name in a Set (could be named e.g. embedding_sample_keys
). I haven't tried this but it sounds pretty viable.hash
structure. (possibly the key would be embedding_data
). This has downsides like not being able to set a distinct TTL for each cache key. You can use HKEYS and HSCAN to access all the keys in a hash, which might be an improvement over scanning the entire dataset.Upvotes: 1