Reputation: 6527
I have a storage with 2 GB of hashes, which i want to check with a public Api.
Let's say I want to create an API which check if a person is known by my product. To respect the persons privacy I don't want to upload his name, member id and so on. So I decide to upload only a hash of the combined Informationen which will identify him. Now I have 2 GB (6*10^7) of SHA256 hashes and want to check them in a insane fast way.
This API should be hosted in azure.
Afte reading the documentation of the azure storage account, I think the Azure Table Storage is the right storage solution. I would set the base64 hash as partition key and leave the row key empty.
What is the fastest way to check if a partition key is present? I think my naive first try is not really the best way.
if(members.Where(x=>x.PartitionKey == Convert.ToBase64String(data.Hash)).AsEnumerable().Any()) { return req.CreateResponse(HttpStatusCode.OK, "Found Hash"); }else { return req.CreateResponse(HttpStatusCode.NotFound, "Don't found Hash"); }
How to upload the 2 GB of hashes? I think about to upload one big file and use azure function to split after each 256 bit and add the value to azure storage. Or any better Idea?
Upvotes: 1
Views: 1485
Reputation: 3169
I saw this is tagged with Azure-Functions, so I'll add that Azure-Functions lets you directly bind to table storage. See https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-table
You can even bind directly to a specific entity. The function.json would look like:
{
"name": "<Name of input parameter in function signature>",
"type": "table",
"direction": "in",
"tableName": "<Name of Storage table>",
"partitionKey": "<PartitionKey of table entity to read - see below>",
"rowKey": "<RowKey of table entity to read - see below>",
}
Upvotes: 0
Reputation: 35134
My take on this:
If the only query you need is "check if existing hash exists" (and retrieve its details if needed), then Table Storage is the perfect match. Key lookups are fast and cheap, and 2 GB is nothing.
Hash gives the most diversity, so I would use it for partition key. Row key can be anything then. If Upload Id
is never used for (range) lookups, don't use it for keys.
With proper partition key, the lookup time should be constant.
If you mean you need to check if user hash is there or not, just retrieve one row by partition key + row key. That's the fastest operation possible. See "Retrieve a single entity" here.
Table Storage supports batch inserts. Again, 2GB is not much, you probably spent more time asking this question than your upload will take :)
Upvotes: 3