Jeff
Jeff

Reputation: 91

AWS EMR (4.x-5.x) classpath for custom jar step

When adding a custom jar step for an EMR cluster - how do you set the classpath to a dependent jar (required library)?

Let's say I have my jar file - myjar.jar but I need an external jar to run it - dependency.jar. Where do you configure this when creating the cluster? I am not using the command line, using the Advanced Options interface.

Thought I would post this after spending a number of hours poking around and reading outdated documentation.

The 2.x/3.x documentation that talks about setting the HADOOP_CLASSPATH does not work. They specify this does not work for 4.x and above anyway. Somewhere you need to specify a --libjars option. However, specifying that in the arguments list does not work either.

For example: Step Name: MyCustomStep Jar Location: s3://somebucket/myjar.jar Arguments: myclassname option1 option2 --libjars dependentlib.jar

Upvotes: 1

Views: 1221

Answers (1)

Dave Speer
Dave Speer

Reputation: 21

Copy your required jars to /usr/lib/hadoop-mapreduce/ in a bootstrap action. No other changes are necessary. Additional info below:

This command below works for me to copy a specific JDBC driver version:

sudo aws s3 cp s3://<your bucket>/mysql-connector-java-5.1.23-bin.jar /usr/lib/hadoop-mapreduce/

I have other dependencies so I have a bootstrap action for each jar I need copied, of course you could put all the copies in a single bash script. Below is .net code I use to get a bootstrap action to run the copy script. I am using .net SDK versions 3.3.* and launching the job with release label emr-5.2.0

public static BootstrapActionConfig CopyEmrJarDependency(string jarName)
{
    return new BootstrapActionConfig()
    {
        Name = $"Copy jars for EMR dependency: {jarName}",
        ScriptBootstrapAction = new ScriptBootstrapActionConfig()
        {
            Path = $"s3n://{Config.AwsS3CodeBucketName}/EMR/Scripts/copy-thirdPartyJar.sh",
            Args = new List<string>()
                {
                    $"s3://{Config.AwsS3CodeBucketName}/EMR/Java/lib/{jarName}",
                    "/usr/lib/hadoop-mapreduce/"
                }
        }
    };
}

Note that the ScriptBootstrapActionConfig Path property uses the protocol "s3n://", but the protocol for the aws cp command should be "s3://"

My script copy-thirdPartyJar.sh contains the following:

#!/bin/bash
# $1 = location of jar
# $2 = attempted magic directory for java classpath
sudo aws s3 cp $1 $2

Upvotes: 2

Related Questions