raviiii1
raviiii1

Reputation: 976

DynamoDB query performance with - a unique partition key vs a unique partition+sort key

Lets say I have a Dynamo DB table named 'student_course'. I want to store the course that each student is taking in a University. One student can take multiple courses at a time and one course can have many student at a time. So basically it is a multi-mapping.

My data access pattern has only one use case -

  1. get a record of one student and one course at a time, i.e. is get data for each StudentId and CourseId combination. Its is guaranteed that for a student-id and course-id combination, there is only one record available.

To achieve this I can store the data in these 2 ways -

  1. Partition-key = {student-id}, sort-key = {course-id}
  2. Partition-key = "studentId:{student-id}_courseId:{course-id}", sort-key does not exist

My Question is - Which query would perform better, if there is any difference, that is? Which one should I choose over the other and why?

Upvotes: 5

Views: 2658

Answers (1)

HJW
HJW

Reputation: 23443

In terms of designing for DDB Performance, the Get API is key to milli-seconds data retrieval capability of DDB, therefore it is logical to design your data around this API

Table with Partition Key + Sort Key

Partition Key | Sort Key    
--------------+-------------
Course1       | Student1
Course1       | Student2

Advantage:

  1. Able to use the Get API to get a Single record by the Partition Key and the Sort Key e.g. Get single record where Partition Key = "Course1" and Sort Key = "Student1"
  2. Able to use the Get API to get a List of records by only the Partition Key e.g. Get all records where Partition Key = "Course1"

Disadvantage:

  1. If you know only the Sort Key (i.e. Student) and not the Partition Key (i.e. Course), you won't be able to use the Get API to retrieve the record only by the Sort Key

Note: In general, the efficiency (such that queries do not hit the ReadThroughput exception "Roof" easily) of DDB Get API queries is quite coupled with the uniqueness and distribution of the Partition Key. The more the partition keys you have and spreaded out, the better the performance

Table with Partition Key only

Partition Key Only   
--------------------
Course1#Student1
Course1#Student2

Advantage:

  1. Able to use the Get API to get a Single record by the Partition Key e.g. Get single record where Partition Key = "Course1#Student1"

Disadvantage:

  1. Will not be able to use the Get API to get a List of records using only a subset of the Partition key e.g. Get a List of records where Partition Key = "Course1"

About GSIs

Note: It is a common scenario to add Global Secondary Index on Tables to support Get API calls using an alternate key, like Get a List of records where GSI Partition Key = Course Name

Partition Key Only   | Non Key Attribute (Course) For GSI
---------------------+---------------------------
Course1#Student1     | Course1
Course1#Student2     | Course1

You can have up to 20 GSI Indexes (soft limit), with the option to remove this limit through a Support request

Partition Key Only   | Non Key Attribute (Course) For GSI | Lecturer (For GSI 2)
---------------------+------------------------------------+---------------------
Course1#Student1     | Course1                            | Lecturer1
Course1#Student2     | Course1                            | Lecturer1

Conclusion

I would design a Table to have as many Unique values for the Partition Keys as possible if the Performance is key i.e. Partition Key = Course1#Student1 VS Partition Key = Course1, Sort Key = Student1

Add GSIs to Tables on-demand if you need to query by an alternate key

(Historically GSIs were limited to 5 per Table, and had to be specified during Table creation, but these limitations are since lifted)

Upvotes: 5

Related Questions