Reprocessing millions of S3 files via AWS Lambda

Question

I'm struggling with a setup for a following use case. I have possibly millions of files in S3 bucket, divided into days. I want to put all of the data of a certain period to timestream for time based queries. Unfortunately, I noticed that a single thread processing on EC2, where I simply iterate through files and send them in batches to Timestream - doesn't work well. It takes around 24h to ingest a single day. So what I tried as an alternative was AWS Lambda processing. I created a temp bucket where I synced a single day of data from the main bucket. Each file triggers my Lambda with S3 Notification. This is pretty neat, allows to scale to unattainable sizes, BUT! The default Concurrency Quota is 1000 for AWS Lambda. I'd be fine if new incoming messages were queued, but they are simply discarded. On top of that, each file (.orc) contains even 90k records. And I noticed that the Timestream boto3 client is rather slow, it takes around 100-150ms on average to save a 100 records batch. So you do the math... Each lambda execution takes up to 3min! And on top of that(!) I also noticed that some saves take more than a second (I assume timestream client throttling or something), so some of the lambdas timeouted after 3min. In the end I managed to get around 1/3 - 1/2 of daily data in a single run.

But it was quick... So what I'm trying to achieve now, is to have some more sustainable way of ingesting this data. Kinesis allows up to 8 or 10 parallelisation factor (based on shards number) - not great. I'd like to run always around 100-200-500 lambdas. So I need a way of queueing S3 notifications and consume them at the pace of couple hundreds at once. Also, maybe timestream should perform better and I'm doing something wrong? My initialisation code:

timestream = boto3.client('timestream-write',
                          config=Config(read_timeout=5, max_pool_connections=5000, retries={'max_attempts': 10}))

Oh and on the side note, I noticed something strange about timestream yesterday. When I triggered processing of the same file over and over again, it didn't Reject Records, instead it silently ignored them responding with 200. Weirdest stuff.

Anyway, any help appreciated as I'm out of ideas.

Reprocessing millions of S3 files via AWS Lambda

Answers (1)

Related Questions