hari
hari

Reputation: 31

How to install sqoop in Amazon EMR?

I've created a cluster in Amazon EMR and using -emr-4.0.0. Hadoop distribution:Amazon 2.6.0 and Hive 1.0.0. Need to install Sqoop so that I can communicate between Hive and Redshift? What are the steps to install Sqoop in EMR cluster? Requesting to provide the steps. Thank You!

Upvotes: 1

Views: 3118

Answers (3)

Sayat Satybald
Sayat Satybald

Reputation: 6580

Note that from Emr-4.4.0 AWS added support for Sqoop 1.4.6 to the EMR cluster. Installation is done with couple clicks on setup. No need for manual installation.

References:

Upvotes: 1

Ana Todor
Ana Todor

Reputation: 801

Note that in EMR 4.0.0 hadoop fs -copyToLocal will throw errors.

Use aws s3 cp instead.

To be more specific than Amal:

  1. Download the latest version of SQOOP and upload it to an S3 location. I am using sqoop-1.4.4.bin__hadoop-2.0.4-alpha and it seems to work just fine with EMR 4.0.0
  2. Download the JAR connector for Redshift and upload it to same S3 location. This page might help.
  3. Upload a script similar to the one below to S3

    #!/bin/bash
    # Install sqoop and mysql connector. Store in s3 and load
    # as bootstrap step.
    
    bucket_location='s3://your-sqoop-jars-location/'
    sqoop_jar='sqoop-1.4.4.bin__hadoop-2.0.4-alpha'
    sqoop_jar_gz=$sqoop_jar.tar.gz
    redshift_jar='RedshiftJDBC41-1.1.7.1007.jar'
    
    cd /home/hadoop
    
    aws s3 cp $bucket_location$sqoop_jar_gz .
    tar -xzf $sqoop_jar_gz
    aws s3 cp $bucket_location$redshift_jar .
    cp $redshift_jar $sqoop_jar/lib/
    
  4. Set SQOOP_HOME and add SQOOP_HOME to the PATH to be able to call sqoop from anywhere. These entries should be made in /etc/bashrc. Otherwise you will have to use the full path, in this case: /home/hadoop/sqoop-1.4.4.bin__hadoop-2.0.4-alpha/bin/sqoop

I am using Java to programatically launch my EMR cluster. To configure bootstrap steps in Java I create a BootstrapActionConfigFactory:

public final class BootstrapActionConfigFactory {
    private static final String bucket = Config.getBootstrapBucket();

    // make class non-instantiable
    private BootstrapActionConfigFactory() {
    }

    /**
     * Adds an install Sqoop step to the job that corresponds to the version set in the Config class.
     */
    public static BootstrapActionConfig newInstallSqoopBootstrapActionConfig() {
        return newInstallSqoopBootstrapActionConfig(Config.getHadoopVersion().charAt(0));
    }

    /**
     * Adds an install Sqoop step to the job that corresponds to the version specified in the parameter
     *
     * @param hadoopVersion the main version number for Hadoop. E.g.: 1, 2
     */
    public static BootstrapActionConfig newInstallSqoopBootstrapActionConfig(char hadoopVersion) {
        return new BootstrapActionConfig().withName("Install Sqoop")
            .withScriptBootstrapAction(
                new ScriptBootstrapActionConfig().withPath("s3://" + bucket + "/sqoop-tools/hadoop" + hadoopVersion + "/bootstrap-sqoop-emr4.sh"));
    }
}

Then when creating the job:

Job job = new Job(Region.getRegion(Regions.US_EAST_1));
 job.addBootstrapAction(BootstrapActionConfigFactory.newInstallSqoopBootstrapActionConfig());

Upvotes: 7

Amal G Jose
Amal G Jose

Reputation: 2546

Download the tarball of sqoop and keep it in an s3 bucket. Create a bootstrap script that performs the following activity

  1. Download the sqoop tarball to the required instances
  2. extract the tarball
  3. set SQOOP_HOME and add SQOOP_HOME to the PATH. These entries should be made in /etc/bashrc
  4. Add the required connector jars to the lib of SQOOP.

Keep this script in S3 and point this script in the bootstrap actions.

Upvotes: 1

Related Questions