Reputation: 8374
I would like to use AWS Data Pipeline to execute an ETL process. Suppose that my process has a small input file and I am would like to use a custom jar or python script to make data transformations. I dont see any reason to use a cluster EMR to make this simple data step. So, I would like to execute this data step in a EC2 single instance.
Looking at the AWS DataPipeline at EMRActivity object, i just see the option to run using an EMR cluster. Is there way to run a computation step inside a EC2 instance? Is it th best solution for this use case? Or Is it better to setup a small EMR (with a single node) and execute a hadoop job?
Upvotes: 2
Views: 585
Reputation: 2068
If you don't need the EMR cluster or Hadoop framework and your execution can easily run on a single instance than you can use the ShellCommandActivity associated with an Ec2Resource (an instance) to perform the work. Simple example is at http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-getting-started.html
Upvotes: 2