franco phong
franco phong

Reputation: 2219

How to query big data in DynamoDB in best practice

I have a scenario: query the list of student in school, by year, and then use that information to do some other tasks, let say printing a certificate for each student

I'm using the serverless framework to deal with that scenario with this Lambda:

const queryStudent = async (_school_id, _year) => {
  var params = {
    TableName: `schoolTable`,
    KeyConditionExpression: 'partition_key = _school_id AND begins_with(sort_key, _year)',
  };

  try {
    let _students = [];
    let items;
    do {
      items = await dynamoClient.query(params).promise();
      _students = items.Items;
     
      params.ExclusiveStartKey = items.LastEvaluatedKey;
    } while (typeof items.LastEvaluatedKey != 'undefined');

    return _students;
  } catch (e) {
    console.log('Error: ', e);
  }
};


const mainHandler = async (event, context) => {
    …
    let students = await queryStudent(body.school_id, body.year);
    await printCerificate(students)
    …
}

So far, it’s working well with about 5k students (just sample data)

My concern: is it a scalable solution to query large data in DynamoDB?

As I know, Lambda has limited time execution, if the number of student goes up to a million, does the above solution still work?

Any best practice approach for this scenario is very appreciated and welcome.

Upvotes: 0

Views: 507

Answers (1)

Maurice
Maurice

Reputation: 13108

If you think about scaling, there are multiple potential bottlenecks here, which you could address:

  • Hot Partition: right now you store all students of a single school in a single item collection. That means that they will be stored on a single storage node under the hood. If you run many queries against this, you might run into throughput limitations. You can use things like read/write sharding here, e.g. add a suffix to the partition key and do scatter-gatter with the data.
  • Lambda: Query: If you want to query a million records, this is going to take time. Lambda might not be able to do that (and the processing) in 15 minutes and if it fails before it's completely through, you lose the information how far you've come. You could do checkpointing for this, i.e. save the LastEvaluatedKey somewhere else and check if it exists on new Lambda invocations and start from there.
  • Lambda: Processing: You seem to be creating a certificate for each student in a year in the same Lambda function you do the querying. This is a solution that won't scale if it's a synchronous process and you have a million students. If stuff fails, you also have to consider retries and build that logic in your code.

If you want this to scale to a million students per school, I'd probably change the architecture to something like this:

You have a Step Function that you invoke when you want to print the certificates. This step function has a single Lambda function. The Lambda function queries the table across sharded partition keys and writes each student into an SQS queue for certificate-printing tasks. If Lambda notices, it's close to the runtime limit, it returns the LastEvaluatedKey and the step function recognizes thas and starts the function again with this offset. The SQS queue can invoke Lambda functions to actually create the certificates, possibly in batches.

This way you decouple query from processing and also have built-in retry logic for failed tasks in the form of the SQS/Lambda integration. You also include the checkpointing for the query across many items.

Implementing this requires more effort, so I'd first figure out, if a million students per school per year is a realistic number :-)

Upvotes: 1

Related Questions