Octopus
Octopus

Reputation: 8325

How do I query DynamoDB when I want to consider the sort key but not the partition key?

I can't figure out how to do this in DynamoDB.

I have a table with data something like this:

ID    Updated     other fields...

1200  2017-12-11              ...
1201  2018-02-05              ...
1205  2018-01-05              ...
1206  2018-01-11              ...
1210  2018-02-15              ...
1212  2018-02-10              ...

The partition key is 'ID' and I have a sort key of 'Updated'.

I want to retrieve the records where Updated is greater than "2018-02-01", say.

I can't query on just 'Updated' alone, it complains with Query condition missed key schema element: ID. I understand what that means, but I'm not sure how to do this properly.

I've tried adding various indexes and then querying on the index, including having only the 'Updated' field as the partition key, but then I can't query for a range of values only an exact match on the partition key.

So, how do I query across multiple partitions for a condition?

I could use a scan, but that is potentially expensive. Can I do this by indexing it a certain way? Or is there a way to do something similar to a query where I don't need to specify the partition key?

Upvotes: 7

Views: 3873

Answers (3)

F_SO_K
F_SO_K

Reputation: 14799

Use a scan

Almost everyone using DynamoDB seems to get worried about scans. Scans are FINE in many circumstances. Things you should ask yourself include; how much data will I have, how will it grow over time, how fast do I need the scan to complete, how many RCUs will this cost? Don't just dismiss scans - do the maths.

Archive data

If you only need to access recent data, consider deleting or archiving old data. By removing it from your table you can increase the performance of scans.

Partition by date

There are various strategies you can use to improve your table performance if you really want to use a query. For example you could have a partition key of YYYY-MM and sort key of datetime (down to nanosecond). That way you can retrieve whole months of data in one query, whilst still being able to sort for specific date ranges. This kind of query is much more complicated to handle in your application than a scan. Architecting your tables really depends on your data access patterns.

Upvotes: 4

gerrytan
gerrytan

Reputation: 41123

I would consider alternative partition keys. For example, will your business logic work if you create a GSI with year as partition key and date as sort key? How about year-month?

Your query will be more complex to write as you might have to issue multiple queries to cover more than 1 partitions to fill your result page.

But as you pointed out, this is cheaper than performing a full table scan.

Upvotes: 1

Costin
Costin

Reputation: 3029

Nice problem, not so nice solution! :)

• You cannot do a query without conditioning on Partition Key.
• You need the Updated column to be a Sorting Key, either in the table "schema", either in an index. If it will not be a sorting key anymore, you wont be able to efficiently query for Updated > VALUE.

So you need a constant partition key and Updated to be the sorting key. Here is your Global Secondary Index:
• PK: ConstantColumn
• SK: Updated

Of course, you'll loose some scalability because all your index will be in one partition, but using a KEYS_ONLY projection should give you enough room.

Should you really need more scalability consider having PK values like C0, C1, ..., Cn, iterate through queries for each partition key, then merge the results (divide et impera).

Upvotes: 2

Related Questions