DynamoDB GSI partition contains large set

Question

I have a DynamoDB table with high number of writes and fewer reads. The Partition Key is small enough (approximately 100 items per partition). The items written into this table are part of a set 1k to 100k items. This works well.

I have a requirement to be able to perform queries on the whole batch using a different Sort Key. To support the requirement I had to create a Global Secondary Index with Batch ID as a Partition Key and the appropriate Sort Key. It works but it means the partition contains the entire (potentially 100k items) set. Even if not hitting the limit of 10gb, this feels suboptimal.

Am I overthinking this and DynamoDB will handle 100k partition just fine?

Are there there any recommended patterns for the situation where GSI would require the whole set to query?

F_SO_K · Accepted Answer

Firstly, if your tables does not have an LSI, there is no limit in how large a single item set (partition key) can be. If the table has an LSI, the limit for one item set is 10GB. This is not to be confused with logical partitions, which have a maximum of 10GB.

To answer your question we really need more information on the access pattern for the GSI.

There is nothing wrong with using a GSI with a single partition to order your data, and then using a Scan on that GSI to get all of that data, or perhaps the first N items. That said, if you are scanning the GSI, you might just want to scan the base table, it would probably be cheaper than creating the GSI. Note that Scan's can actually be fairly fast, make sure you use parallel scans and set the number of threads to equal about the number of MBs of data in the table. They are expensive though as they consume RCU for every item in the table.

However, if you are planning to Query the data and say 'Give me the data between X date and Y date', your approach is probably not good. That Query is likely to be quite slow because Queries don't have parallel processing like Scans do.

Instead, you might want to consider a time-series pattern. Basically, you create a field with a date block (let's say a single day like 2020-10-13) and make that the partition key. Now you can get the data you need using a series of Queries, one for each day in your date range.

The time-series principle, of making blocks of data out of your continuous range key, can be applied to any continous data, not just time.

In short, if you are planning to Query a GSI with a single partition key, think again. If you are planning to Scan a GSI with a single partition key, that's probably fine.

DynamoDB GSI partition contains large set

Answers (1)

Related Questions