Disclaimer: I know this question will annoy some people because it's vague, theoretical, and has little code. I have a AWS Lambda function in Python which reads a file of denormalized records off S3, formats its contents correctly, and then uploads that to DynamoDB with a batch write. It all works as advertised. I then tried to break up the uploading part of this pipeline into threads with the hope of more efficiently utilizing DynamoDBs write capacity. However, the multithread version is slower by about 50%. Since the code is very long I have included pseudocode. NUM_THREADS = 4 for every line in the file: Add line to list of lines if we've read enough lines for a single thread: Create thread that uploads list of lines thread.start() clear list of lines. for every thread started: thread.join() Important notes and possible sources of the problem I've checked so far: When testing this locally using DynamoDB Local, threading does make my program run faster. If instead I use only 1 thread, or even if I use multiple threads but I join the thread right after I start it (effectively single threaded), the program completes much quicker. With 1 thread ~30s, multi thread ~45s. I have no shared memory between threads, no locks, etc. I have tried creating new DynamoDB connections for each thread and sharing one connection instead, with no effect. I have confirmed that adding more threads does not overwhelm the write capacity of DynamoDB, since it makes the same number of batch write requests and I don't have more unprocessed items throughout execution than with a single thread. Threading should improve the execution time since the program is network bound, even though Python threads do not really run on multiple cores. I have tried reading the entire file first, and then spawning all the threads, thinking that perhaps it's better to not interrupt the disk IO, but to no effect. I have tried both the Thread library as well as the Process library. Again I know this question is very theoretical so it's probably hard to see the source of the issue, but is there some Lambda quirk I'm not aware of? Is there something I else I can try to help diagnose the issue? Any help is appreciated.

pythonmultithreadingamazon-web-servicesamazon-dynamodbaws-lambda

Reputation: 530

Slower execution of AWS Lambda batch-writes to DynamoDB with multiple threads

Disclaimer: I know this question will annoy some people because it's vague, theoretical, and has little code.

I have a AWS Lambda function in Python which reads a file of denormalized records off S3, formats its contents correctly, and then uploads that to DynamoDB with a batch write. It all works as advertised. I then tried to break up the uploading part of this pipeline into threads with the hope of more efficiently utilizing DynamoDBs write capacity. However, the multithread version is slower by about 50%. Since the code is very long I have included pseudocode.

NUM_THREADS = 4
for every line in the file:
   Add line to list of lines
   if we've read enough lines for a single thread:
       Create thread that uploads list of lines
       thread.start()
       clear list of lines.

for every thread started:
    thread.join()

Important notes and possible sources of the problem I've checked so far:

When testing this locally using DynamoDB Local, threading does make my program run faster.
If instead I use only 1 thread, or even if I use multiple threads but I join the thread right after I start it (effectively single threaded), the program completes much quicker. With 1 thread ~30s, multi thread ~45s.
I have no shared memory between threads, no locks, etc.
I have tried creating new DynamoDB connections for each thread and sharing one connection instead, with no effect.
I have confirmed that adding more threads does not overwhelm the write capacity of DynamoDB, since it makes the same number of batch write requests and I don't have more unprocessed items throughout execution than with a single thread.
Threading should improve the execution time since the program is network bound, even though Python threads do not really run on multiple cores.
I have tried reading the entire file first, and then spawning all the threads, thinking that perhaps it's better to not interrupt the disk IO, but to no effect.
I have tried both the Thread library as well as the Process library.

Again I know this question is very theoretical so it's probably hard to see the source of the issue, but is there some Lambda quirk I'm not aware of? Is there something I else I can try to help diagnose the issue? Any help is appreciated.

Upvotes: 5

Answers (3)

Nate

Reputation: 530

Turns out that the threading is faster, but only when the file reached a certain file size. I was originally work on a file size of about 1/2 MB. With a 10 MB file, the threaded version came out about 50% faster. Still unsure why it wouldn't work with the smaller file, maybe it just needs time to get a'cooking, you know what I mean? Computers are moody things.

Upvotes: 0

Tim Reichard

Reputation: 1

As a backdrop I have good experience with python and dynamoDB along with using python's multiprocessing library. Since your file size was fairly small it may have been the setup time of the process that confused you about performance. If you haven't already, use python multiprocessing pools and use map or imap depending on your use case if you need to communicate any data back to the main thread. Using a pool is the darn simpliest way to run multiple processes in python. If you need your application to run faster as a priority you may want to look into using golang concurrency and you could always build the code into binary to use from within python. Cheers.

Upvotes: 0

pmueller

Reputation: 323

Nate, have you completely ruled out a problem on the Dynamodb end? The total number of write requests may be the same, but the number per second would be different with a multi-thread.

The console has some useful graphs to show if your writes (or batch writes) are being throttled at all. If you don't have the right 'back off, retry' logic in your Lambda function, Lambda will just try and try again and your problem gets worse.

One other thing, which might have been obvious to you (but not me!). I was under the impression that batch_writes saved you money on the capacity planning front. (That 200 writes in batches of 20 would only cost you 10 write units, for example. I could have sworn I heard an AWS guy mention this in a presentation, but that's beside the point.)

In fact the batch_writes save you some time, but nothing economically.

One last thought: I'd bet that Lambda processing time is cheaper than upping your Dynamodb write capacity. If you're in no particular rush for Lambda to finish, why not let it run its course on single-thread?

Good luck!

Upvotes: 1

Slower execution of AWS Lambda batch-writes to DynamoDB with multiple threads

Disclaimer: I know this question will annoy some people because it's vague, theoretical, and has little code.

Important notes and possible sources of the problem I've checked so far:

Answers (3)

Related Questions