hdev
hdev

Reputation: 6527

How to use Azure Table Storage for huge lookups

I have a storage with 2 GB of hashes, which i want to check with a public Api.

Use Case

Let's say I want to create an API which check if a person is known by my product. To respect the persons privacy I don't want to upload his name, member id and so on. So I decide to upload only a hash of the combined Informationen which will identify him. Now I have 2 GB (6*10^7) of SHA256 hashes and want to check them in a insane fast way.

This API should be hosted in azure.

Afte reading the documentation of the azure storage account, I think the Azure Table Storage is the right storage solution. I would set the base64 hash as partition key and leave the row key empty.

Question

  1. First, is the Azure Table the right storage for the job?
  2. Will it be a performance different between:
    1. partition key: base64 hash, row key: empty
    2. partition key: 'Upload Id', row key: empbase64 hash
  3. Does the time to access trough keys depends on the size of the table?
  4. What is the fastest way to check if a partition key is present? I think my naive first try is not really the best way.

    if(members.Where(x=>x.PartitionKey == Convert.ToBase64String(data.Hash)).AsEnumerable().Any()) { return req.CreateResponse(HttpStatusCode.OK, "Found Hash"); }else { return req.CreateResponse(HttpStatusCode.NotFound, "Don't found Hash"); }

  5. How to upload the 2 GB of hashes? I think about to upload one big file and use azure function to split after each 256 bit and add the value to azure storage. Or any better Idea?

Upvotes: 1

Views: 1485

Answers (2)

Mike S
Mike S

Reputation: 3169

I saw this is tagged with Azure-Functions, so I'll add that Azure-Functions lets you directly bind to table storage. See https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-table

You can even bind directly to a specific entity. The function.json would look like:

{
    "name": "<Name of input parameter in function signature>",
    "type": "table",
    "direction": "in",
    "tableName": "<Name of Storage table>",
    "partitionKey": "<PartitionKey of table entity to read - see below>",
    "rowKey": "<RowKey of table entity to read - see below>",
}

Upvotes: 0

Mikhail Shilkov
Mikhail Shilkov

Reputation: 35134

My take on this:

  1. If the only query you need is "check if existing hash exists" (and retrieve its details if needed), then Table Storage is the perfect match. Key lookups are fast and cheap, and 2 GB is nothing.

  2. Hash gives the most diversity, so I would use it for partition key. Row key can be anything then. If Upload Id is never used for (range) lookups, don't use it for keys.

  3. With proper partition key, the lookup time should be constant.

  4. If you mean you need to check if user hash is there or not, just retrieve one row by partition key + row key. That's the fastest operation possible. See "Retrieve a single entity" here.

  5. Table Storage supports batch inserts. Again, 2GB is not much, you probably spent more time asking this question than your upload will take :)

Upvotes: 3

Related Questions