Mark J Miller
Mark J Miller

Reputation: 4871

How do you run an EmrActivity on an existing EMR cluster?

Is there a way to run an EmrActivity in AWS Data Pipeline on an existing cluster? We currently are using Data Pipeline to run jobs in AWS EMR using EmrCluster and EmrActivity but we'd like to have all pipelines run on the same cluster. I've tried reading the documentation and building a pipeline in architect but I can't seem to find a way to do anything but create a cluster and run jobs on it. There doesn't seem to be a way to define a new pipeline which uses an existing cluster. If there is how would I do it? We're currently using CloudFormation to create our pipelines so if possible an example using CloudFormation would be preferable but I'll take what I can get.

Upvotes: 2

Views: 1845

Answers (1)

enisher
enisher

Reputation: 309

Yes it is possible.

  1. Launch your EMR cluster
  2. Start TaskRunner on the master instance with the option --workerGroup=name-of-the-worker-group
  3. In the activities of your pipeline don't specify runsOn parameter, pass your worker group instead.

Here is an example of the activity with such parameter defined using CloudFormation:

...
{
        "Id": "S3ToRedshiftCopyActivity",
        "Name": "S3ToRedshiftCopyActivity",
        "Fields": [
          {
            "Key": "type",
            "StringValue": "RedshiftCopyActivity"
          },
          {
            "Key": "workerGroup",
            "StringValue": "name-of-the-worker-group"
          },
          {
            "Key": "insertMode",
            "StringValue": "#{myInsertMode}"
          },
          {
            "Key": "commandOptions",
            "StringValue": "FORMAT CSV"
          },
          {
            "Key": "dependsOn",
            "RefValue": "RedshiftTableCreateActivity"
          },
          {
            "Key": "input",
            "RefValue": "S3StagingDataNode"
          },
          {
            "Key": "output",
            "RefValue": "DestRedshiftTable"
          }
        ]
}
...

You can find detailed documentation how to do that here: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-how-task-runner-user-managed.html

Upvotes: 4

Related Questions