clay
clay

Reputation: 20450

Spark + Amazon S3 "s3a://" urls

AFAIK, the newest, best S3 implementation for Hadoop + Spark is invoked by using the "s3a://" url protocol. This works great on pre-configured Amazon EMR.

However, when running on a local dev system using the pre-built spark-2.0.0-bin-hadoop2.7.tgz, I get

Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
    ... 99 more

Next I tried to launch my Spark job specifying the hadoop-aws addon:

$SPARK_HOME/bin/spark-submit --master local \
    --packages org.apache.hadoop:hadoop-aws:2.7.3 \
    my_spark_program.py

I get

    ::::::::::::::::::::::::::::::::::::::::::::::

    ::              FAILED DOWNLOADS            ::

    :: ^ see resolution messages for details  ^ ::

    ::::::::::::::::::::::::::::::::::::::::::::::

    :: com.google.code.findbugs#jsr305;3.0.0!jsr305.jar

    :: org.apache.avro#avro;1.7.4!avro.jar

    :: org.xerial.snappy#snappy-java;1.0.4.1!snappy-java.jar(bundle)

    ::::::::::::::::::::::::::::::::::::::::::::::

I made a dummy build.sbt project in a temp directory with those three dependencies to see if a basic sbt build could successfully download those and I got:

[error] (*:update) sbt.ResolveException: unresolved dependency: org.apache.avro#avro;1.7.4: several problems occurred while resolving dependency: org.apache.avro#avro;1.7.4 {compile=[default(compile)]}:
[error]     org.apache.avro#avro;1.7.4!avro.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/org/apache/avro/avro/1.7.4/avro-1.7.4.pom
[error]     org.apache.avro#avro;1.7.4!avro.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/org/apache/avro/avro/1.7.4/avro-1.7.4.pom
[error] 
[error] unresolved dependency: com.google.code.findbugs#jsr305;3.0.0: several problems occurred while resolving dependency: com.google.code.findbugs#jsr305;3.0.0 {compile=[default(compile)]}:
[error]     com.google.code.findbugs#jsr305;3.0.0!jsr305.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/com/google/code/findbugs/jsr305/3.0.0/jsr305-3.0.0.pom
[error]     com.google.code.findbugs#jsr305;3.0.0!jsr305.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/com/google/code/findbugs/jsr305/3.0.0/jsr305-3.0.0.pom
[error] 
[error] unresolved dependency: org.xerial.snappy#snappy-java;1.0.4.1: several problems occurred while resolving dependency: org.xerial.snappy#snappy-java;1.0.4.1 {compile=[default(compile)]}:
[error]     org.xerial.snappy#snappy-java;1.0.4.1!snappy-java.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/org/xerial/snappy/snappy-java/1.0.4.1/snappy-java-1.0.4.1.pom
[error]     org.xerial.snappy#snappy-java;1.0.4.1!snappy-java.pom(pom.original) origin location must be absolute: file:/Users/username/.m2/repository/org/xerial/snappy/snappy-java/1.0.4.1/snappy-java-1.0.4.1.pom
[error] Total time: 2 s, completed Sep 2, 2016 6:47:17 PM

Any ideas on how I can get this working?

Upvotes: 4

Views: 2584

Answers (2)

stevel
stevel

Reputation: 13480

If you are using Apache Spark (that is: I'm ignoring the build Amazon ship in EMR), you need to add a dependency on org.apache.hadoop:hadoop-aws for exactly the same version of Hadoop that the rest of spark uses. This adds the S3a FS and the transitive dependencies. The version of the AWS SDK must be the same as that used to build the hadoop-aws library, as it's a bit of a moving target.

See: Apache Spark and Object Stores

Upvotes: 0

Pat
Pat

Reputation: 737

It looks like you need additional jars in your submit flag. The Maven repository has a number of AWS packages for Java which you can use to fix your current error: https://mvnrepository.com/search?q=aws

I continuously receive headaches with the S3A filesystem error; but the aws-java-sdk:1.7.4 jar works for Spark 2.0.

Further dialogue on the matter can be found here; albeit there is indeed an actual package in the Maven AWS EC2 repository.

https://sparkour.urizone.net/recipes/using-s3/

Try this:

spark-submit --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 my_spark_program.py

Upvotes: 1

Related Questions