Reputation: 1585
Here are the steps for my application in AWS .
S3
folders .How can i achieve this ?
As far as i have searched there are two options .
AWS lambda
function on S3 event and lambda will create EMR cluster and will do spark-submit .Will AWS Data pipeline
will be helpful in my scenario ?
Also i have spark-scala script that i have been running zeppelin . If required i can create jar out of that and submit in data pipe line .
Please consider the cost also .I have 5TB of data to be delivered to client weekly .
Upvotes: 1
Views: 3934
Reputation: 621
I think you should use Data pipelines. The pipelines will take care of the EMR creation, submission of the job and shutting down the EMR once processing is completed. You can specify the steps for EMR in the "activity" section. "Resource" section can specify the parameters of the EMR cluster (like instance type/role to use etc)
You can even configure an alert - to send you an email via SNS if the pipeline fails for some reason.
Now coming to the part about how to trigger the pipeline. If the data coming in is at predetermined times, you could consider using "schedule" in the pipeline. The pipeline will then activate at the specified time every day/week/month.
Upvotes: 2