Anant Bhandarkar
Anant Bhandarkar

Reputation: 377

Kinesis Firehose Stream Empty

I am triggering an AWS lambda from an EC2 instance multiple times in a loop passing subset of a 350MB dataset to Lambda which manipulates each data set passed to it. The Lambda writes the output to a Kinesis Firehose stream which then writes it to an S3 Bucket. Buffer Size is 50MB and 350 seconds is S3 buffer interval for the Kinesis Firehose stream. So I get around 7 files of 50 MB each after 6-7 mins.

I want to trigger a Lambda which combines all the files in S3 which has data in JSON and creates a CSV file out of it after Kinesis Firehose stream is done writing all files to S3.

The challenge is how do I know that all the Lambda's are done with their operations and Kinesis Firehose buffer is empty as it has written all files to S3, so that I can trigger this Lambda which creates the CSV file from all the JSON files in S3.

One option is that I after the loop I wait for 350 seconds and then trigger the CSV creation lambda after the last lambda has been called.

Is there a way to trigger lambda after all the Kinesis Firehose stream data is written rather than use a timer.

Upvotes: 0

Views: 845

Answers (2)

Sachin Tiwari
Sachin Tiwari

Reputation: 342

i am not sure about your use case like why are you using Firehose, But if u want to go with it , then it can work with below conditions

  1. Increase Buffer size to 350 MB instead 50 MB.
  2. Increase time to 7 Minutes

in this way u will get whole one chuck of file of size 350 MB and then u can trigger lambda which will convert it to JSON.

Anyways you are waiting 6-7 minutes to get 350 MB Data got transferred so its same thing performance wise to make it 350 MB buffer size and 7 minute time

Upvotes: 0

Perimosh
Perimosh

Reputation: 2824

You design has some flaws IMO:

  • why do you use a firehose for this? a single EC2 can't write that fast to take advantage of the firehose.
  • why do you need to split the file? Just save it in S3 from EC2, then have a S3 trigger to invoke the lambda, process the file from the lambda to create the CSV file.
  • if you care about order, then kinesis/firehose isn't for you.

Where you stand now you can control how to invoke lambdas (async vs sync), you can have a S3 trigger, but you can't know when kinesis/firehose is done. You will have to change your code/design to really not find your self in a nightmare. You can't just wait X number of seconds on kinesis/firehose, there are many reasons to have a delay in the records consumption that will break your design.

Either:

  • Don't split the file
  • Don't use kinesis/firehose

Upvotes: 0

Related Questions