Philip Philippou
Philip Philippou

Reputation: 43

AWS EMR Multiple Jobs Dependency Contention

Problem

I am attempting to run 2 pyspark steps in EMR both reading from Kinesis using KinesisUtils. This requires dependent library, spark-streaming-kinesis-asl_2.11.

I'm using Terraform to stand up the EMR cluster and invoke the steps both with args:

--packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.5

There appears to be contention on start up with both steps downloading the jar from maven and causing a checksum failure.

Things attempted

  1. I've tried to move the download of the jar to the bootstrap bash script using:

sudo spark-shell --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.5

This causes problems as spark-shell is only available on the master node and bootstrap tries to run on all nodes.

  1. I've tried to limit the above to only run on master using

grep-q'"isMaster":true'/mnt/var/lib/info/instance.json ||{echo "Not running on masternode,nothing further to do" && exit 0;}

That didn't seem to work.

  1. I've attempted to add spark configuration to do this in EMR configuration.json

    {

    "Classification": "spark-defaults",

    "Properties": {

    "spark.jars.packages": "org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.5"
    

    }

    }

This also didn't work and seemed to stop all jars being copied to the master node dir

/home/hadoop/.ivy2/cache

What does work manually is logging onto the master node and running

sudo spark-shell --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.5

Then submitting the jobs manually without the --packages option.

Currently, all I need to do is manually start the failed jobs separately (clone steps in AWS console) and everything runs fine.

I just want to be able to start the cluster with all steps successfully starting, any help would be greatly appreciated.

Upvotes: 4

Views: 717

Answers (1)

srikanth holur
srikanth holur

Reputation: 780

  1. Download the required jars and upload them to s3.(One time)
  2. While running your pyspark jobs from step, pass --jars <s3 location of jar> in your spark-submit

Upvotes: 3

Related Questions