Vadorequest
Vadorequest

Reputation: 18059

Serverless - DynamoDB (terrible) performances compared to RethinkDB + AWS Lambda

In the process of migrating an existing Node.js (Hapi.js) + RethinkDB from an OVH VPS (smallest vps) to AWS Lambda (node) + DynamoDB, I've recently come across a very huge performance issue.

The usage is rather simple, people use an online tool, and "stuff" gets saved in the DB, passing through a node.js server/lambda. That "stuff" takes some spaces, around 3kb non-gzipped (a complex object with lots of keys and children, hence why using a NOSQL solution makes sense)

There is no issue with the saving itself (for now...), not so many people use the tool and there isn't much simultaneous writing to do, which makes sense to use a Lambda instead of a 24/7 running VPS.


The real issue is when I want to download those results.

So, the operation takes around 3s with RethinkDB, and would theoretically take 45sec with DynamoDB, for the same amount of fetched data.

Let's look at those data now. There are 2200 items in the table, for a total of 5MB, here are the DynamoDB stats:

Provisioned read capacity units 29 (Auto Scaling Enabled)
Provisioned write capacity units    25 (Auto Scaling Enabled)
Last decrease time  October 24, 2018 at 4:34:34 AM UTC+2
UTC: October 24, 2018 at 2:34:34 AM UTC

Local: October 24, 2018 at 4:34:34 AM UTC+2

Region (Ireland): October 24, 2018 at 2:34:34 AM UTC

Last increase time  October 24, 2018 at 12:22:07 PM UTC+2
UTC: October 24, 2018 at 10:22:07 AM UTC

Local: October 24, 2018 at 12:22:07 PM UTC+2

Region (Ireland): October 24, 2018 at 10:22:07 AM UTC

Storage size (in bytes) 5.05 MB
Item count  2,195

There is 5 provisioned read/write capacity units, with an autoscaling max to 300. But the autoscaling doesn't seem to scale as I'd expect, went from 5 to 29, could use 300 which would be enough to download 5MB in 30 sec, but doesn't use them (I'm just getting started with autoscaling so I guess it's misconfigured?)

cloudwatch

Here we can see the effect of the autoscaling, which does increase the amount of read capacity units, but it does so too late and the timeout has happened already. I've tried to download the data several times in a row and didn't really see much improvements, even with 29 units.

The Lambda itself is configured with 128MB RAM, increasing to 1024MB has no effect (as I'd expect, it confirms the issue comes from DynamoDB scan duration)


So, all this makes me wonder why DynamoDB can't do in 30sec what does RethinkDB in 3sec, it's not related to any kind of indexing since the operation is a "scan", therefore must go through all items in the DB in any order.

I wonder how am I supposed to fetch that HUGE dataset (5MB!) with DynamoDB to generate a CSV.

And I really wonder if DynamoDB is the right tool for the job, I really wasn't expecting so low performances compared to what I've been using by the past (mongo, rethink, postgre, etc.)

I guess it all comes down to proper configuration (and there probably are many things to improve there), but even so, why is it such a pain to download a bunch of data? 5MB is not a big deal but there if feels like it requires a lot of efforts and attention, while it's just a common operation to export a single table (stats, dump for backup, etc.)


Edit: Since I created this question, I read https://hackernoon.com/the-problems-with-dynamodb-auto-scaling-and-how-it-might-be-improved-a92029c8c10b which explains in-depth the issue I've met. Basically, autoscaling is slow to trigger, which explains why it doesn't scale right with my use case. This article is a must-read if you want to understand how DynamoDB auto-scaling works.

Upvotes: 1

Views: 1130

Answers (2)

F_SO_K
F_SO_K

Reputation: 14859

I have come across exactly the same problem in my application (i.e. DynamoDB autoscaling does not kick in fast enough for an on-demand high intensity job).

I was pretty committed to DynamoDB by the time I can across the problem, so I worked around it. Here is what I did.

When I'm about to start a high-intensity job, I programatically increase the RCU and WCU on my DynamoDB table. In your case you could probably have one lambda to increase the throughput, then have that lambda kick off another one to do the high-intensity job. Note that increasing provision can take a few seconds, hence splitting this into a separate lambda is probably a good idea.

I will paste my personal notes on the problem I faced below. Apologies but I can't be bothered to format them into stackoverflow markup.


We want enough throughput provisioned all the time so that users have a fast experience, and even more importantly, don't get any failed operations. However, we only want to provision enough throughput to serve our needs, as it costs us money.

For the most part we can use Autoscaling on our tables, which should adapt our provisioned throughput to the amount actually being consumed (i.e. more users = more throughput automatically provisioned). This fails in two key aspects for us:

Autoscaling only increases throughput about 10 minutes after the throughput provision threshold is breached. When it does start scaling up, it is not very aggressive in doing so. There is a great blog on this here https://hackernoon.com/the-problems-with-dynamodb-auto-scaling-and-how-it-might-be-improved-a92029c8c10b. When there is literally zero consumption of throughput, DynamoDB does not decrease throughput. AWS Dynamo not auto-scaling back down The place we really need to manage throughput is on the Invoice table WCUs. RCUs are a lot cheaper than WCUs, so reads are less of a worry to provision. For most tables, provisioning a few RCU and WCU should be plenty. However, when we do an extract from the source, our write capacity on the Invoices table is high for a 30 minute period.

Lets imagine we just relied on Autoscaling. When a user kicked off an extract, we would have 5 minutes of burst capacity, which may or may not be enough throughput. Autoscaling would kick in after around 10 minutes (at best), but it would do so ponderously - not scaling up as fast we needed. Our provision would not be high enough, we would get throttled, and we would fail to get the data we wanted. If several processes were running concurrently, this problem would be even worse - we just couldn't handle multiple extracts at the same time.

Fortunately we know when we are about to beast the Invoices table, so we can programatically increase throughput on the Invoices table. Increasing throughput programatically seems to take effect very quickly. Probably within seconds. I noticed in testing that the Metrics view in DynamoDB is pretty useless. Its really slow to update and I think sometimes it just showed the wrong information. You can use AWS CLI to describe the table, and see what the throughput is provisioned at in real-time:

aws dynamodb describe-table --table-name DEV_Invoices

In theory we could just increase throughput when an extract started, and then reduce it again when we were finished. However, whilst you can increase throughput provision as often as you like, you can only decrease throughput provision 4 times in a day, although you can then decrease throughput once every hour (i.e. 27 times in 24 hours). https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html#default-limits-throughput. This approach is not going to work, as our decrease in provision might well fail.

Even if Autoscaling is in play, it still has to abide by the provisioning decrease rules. So if we've decreased 4 times, Autoscaling will have to wait an hour before decreasing again - and thats for both read and write values

Increasing throughput provision programatically is a good idea, we can do it fast (much faster than Autoscaling), so it works for our infrequent high workloads. We can't decrease throughput programtically after an extract (see above) but there are a couple of other options.

Autoscaling for throughput decrease

Note that even when Autoscaling is set, we can programtically change it to anything we like (e.g. higher than the maximum Autoscaling level).

We can just rely on Autoscaling to bring the capacity back down an hour or two after the extract has finished, that's not going to cost us too much.

There is another problem though. If our consumed capacity drops right down to zero after an extract, which is likely, no consumption data is sent to CloudWatch and Autoscaling doesn't do anything to reduce provisioned capacity, leaving us stuck on a high capacity.

There are two fudge options to fix this though. Firstly we can set the minimum and maximum throughput provision to be same the same value. So for example setting the minimum and maximum provisioned RCUs within Autoscaling to 20 will ensure that the provisioned capacity returns to 20, even if there is zero consumed capacity. Im not sure why but this works (i've tested it, and it does), AWS acknowledge the workaround here:

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html

The other option is to create a Lambda function to attempt to execute a (failed) read and delete operation on the table every minute. Failed operations still consume capacity which is why this works. This job ensures data is sent to CloudWatch regularly, even when our 'real' consumption is zero, and therefore Autoscaling will reduce capacity correctly.

Note that read and write data is sent separately to CloudWatch. So if we want WCUs to decrease when real consumed WCUs are zero, we need to use a write operation (i.e. a delete). Similarly we need a read operation to make sure RCUs are updated. Note that failed Reads (if the item does not exist) and failed Deletes (if the item does not exist) but still consume throughput.

Lambda for throughput decrease

In the previous solution we used a Lambda function to continously 'poll' the table, thus creating the CloudWatch data which enables the DynamoDB Autoscaling to function. As an alternative we could just have a lambda which runs regularly and scales down the throughput when required. When you 'describe' a DynamoDB table, you get the current Provisioned Throughput as well as the last increase datetime and last decrease datetime. So the lambda could say: if the provisioned WCUs are over a threshold and the last time we had a throughput increase was more than half an hour ago (i.e. we are not in the middle of an extract), lets decrease the throughput right down.

Given that this is more code than the Autoscaling option, im not inclined to do this one.

Upvotes: 1

Gareth McCumskey
Gareth McCumskey

Reputation: 1540

DynamoDB is not designed for that kind of usage. Its not like a traditional DB that you can just query as you wish and it especially does not do well with large datasets at a time such as the one you are requesting.

For this type of scenario, I actually use DyanamoDB streams to create a projection into an S3 bucket and then do large exports in that way. It will probably even be faster than the RethinkDB export you reference.

In short, DynamoDb is best as a transactional key-value store for known queries.

Upvotes: 1

Related Questions