Xaptor
Xaptor

Reputation: 25

MS SQL to DynamoDB migration, what's the best partition key to chose in my case

i am working on a migration from MS Sql to DynamoDB and i'm not sure what's the best hash key for my purpose. In MS SQL i've an item table where i store some product information for different customers, so actually the primary key are two columns customer_id and item_no. In application code i need to query specific items and all items for a customer id, so my first idea was to setup the customer id as hash key and the item no as range key. But is this the best concept in terms of partitioning? I need to import product data daily with 50.000-100.000 products for some larger customers and as far as i know it would be better to have a random hash key. Otherwise the import job will run on one partition only. Can somebody give me a hint what's the best data model in this case?

Bye, Peter

Upvotes: 0

Views: 145

Answers (2)

b-s-d
b-s-d

Reputation: 5153

You are on the right track, you should really be careful on how you are handling write operations as you are executing an import job in a daily basis. Also avoid adding indexes unnecessarily as they will only multiply your writing operations.

Using customer_id as hash key and item_no as range key will provide the best option not only to query but also to upload your data.

As you mentioned, randomization of your customer ids would be very helpful to optimize the use of resources and prevent a possibility of a hot partition. In your case, I would follow the exact example contained in the DynamoDB documentation:

[...] One way to increase the write throughput of this application would be to randomize the writes across multiple partition key values. Choose a random number from a fixed set (for example, 1 to 200) and concatenate it as a suffix [...]

So when you are writing your customer information just randomly assign the suffix to your customer ids, make sure you distribute them evenly (e.g. CustomerXYZ.1, CustomerXYZ.2, ..., CustomerXYZ.200).

To read all of the items you would need to obtain all of the items for each suffix. For example, you would first issue a Query request for the partition key value CustomerXYZ.1, then another Query for CustomerXYZ.2, and so on through CustomerXYZ.200. Because you know the suffix range (on this case 1...200), you only need to query the records appending each suffix to the customer id.

Each query by the hash key CustomerXYZ.n should return a set of items (specified by the range key) from that specific customer, your application would need to merge the results from all of the Query requests.

This will for sure make your life harder to read the records (in terms of the additional requests needed), however, the benefits of optimized throughput and performance will pay off. Remember a hot partition will not only increase your overall financial cost, but will also impact drastically your performance.

If you have a well designed partition key your queries will always return very quickly with minimum cost.

Additionally, make sure your import job does not execute write operations grouped by customer, for example, instead of writing all items from a specific customer in series, sort the write operations so they are distributed across all customers. Even though your customers will be distributed by several partitions (due to the id randomization process), you are better off taking this additional safety measure to prevent a burst of write activity in a single partition. More details below:

From the 'Distribute Write Activity During Data Upload' section of the official DynamoDB documentation:

To fully utilize all of the throughput capacity that has been provisioned for your tables, you need to distribute your workload across your partition key values. In this case, by directing an uneven amount of upload work toward items all with the same partition key value, you may not be able to fully utilize all of the resources DynamoDB has provisioned for your table. You can distribute your upload work by uploading one item from each partition key value first. Then you repeat the pattern for the next set of sort key values for all the items until you upload all the data [...]

Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

I hope that helps. Regards.

Upvotes: 0

Mark B
Mark B

Reputation: 201058

It sounds like you need item_no as the partition key, with customer_id as the sort key. Also, in order to query all items for a customer_id efficiently you will want to create a Global Secondary Index on customer_id.

This configuration should give you a good distribution while allowing you to run the queries you have specified.

Upvotes: 0

Related Questions