Sudarshan kumar
Sudarshan kumar

Reputation: 1585

AWS data pipeline VS lambda for EMR automation

Here are the steps for my application in AWS .

  1. Data will be loaded weekly in separate 35 S3 folders .
  2. On completion of data loading in each 35 folders 35 EMR cluster will be created .
  3. Each EMR cluster will have spark-scala script to run parrelly .
  4. On completion of job all cluster will be terminated .

How can i achieve this ?

As far as i have searched there are two options .

  1. Invoking AWS lambda function on S3 event and lambda will create EMR cluster and will do spark-submit .
  2. I read about AWS data pipeline .

Will AWS Data pipeline will be helpful in my scenario ?

Also i have spark-scala script that i have been running zeppelin . If required i can create jar out of that and submit in data pipe line .

Please consider the cost also .I have 5TB of data to be delivered to client weekly .

Upvotes: 1

Views: 3934

Answers (1)

abiydv
abiydv

Reputation: 621

I think you should use Data pipelines. The pipelines will take care of the EMR creation, submission of the job and shutting down the EMR once processing is completed. You can specify the steps for EMR in the "activity" section. "Resource" section can specify the parameters of the EMR cluster (like instance type/role to use etc)

You can even configure an alert - to send you an email via SNS if the pipeline fails for some reason.

Now coming to the part about how to trigger the pipeline. If the data coming in is at predetermined times, you could consider using "schedule" in the pipeline. The pipeline will then activate at the specified time every day/week/month.

Upvotes: 2

Related Questions