Jake Fund
Jake Fund

Reputation: 395

Installing dependencies/libraries for EMR for spark-shell

I am trying add extra libraries to scala used through spark-shell through the Elsatic MapReduce inatance. But I am unsure how to go by this, is there a build tool that is used when spark-shell runs?

All i need to do is install a scala library and have it run through the spark-shell version of scala, Im not sure how to go about this since Im not sure how the EMR instance installs scala and spark.

Upvotes: 2

Views: 4446

Answers (1)

eliasah
eliasah

Reputation: 40380

I think that this answer will evolve with the information you give. As for now, considering that you have AWS EMR cluster deployed on which you wish to use the spark-shell. There is many options :

Option 1 : You can copy your libraries to the cluster with the scp command and add them into your spark-shell with the --jars options. e.g :

from your local machine :

scp -i awskey.pem /path/to/jar/lib.jar hadoop@emr-cluster-address:/path/to/destination

on your EMR cluster :

spark-shell --master yarn --jars lib.jar

Spark uses the following URL scheme to allow different strategies for disseminating jars:

  • file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.
  • hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected
  • local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.

Option 2 : You can have a copy of your libraries from S3 and add them with --jars option.

Option 3 : You can use the --packages options to load it from remote repository. You can include any other dependencies by supplying a comma-delimited list of maven coordinates. All transitive dependencies will be handled when using this command. Additional repositories (or resolvers in SBT) can be added in a comma-delimited fashion with the flag --repositories. These commands can be used with pyspark, spark-shell, and spark-submit to include Spark Packages.

For Python, the equivalent --py-files option can be used to distribute .egg, .zip and .py libraries to executors.

Upvotes: 2

Related Questions