null
null

Reputation: 3517

Hadoop can list s3 contents but spark-shell throws ClassNotFoundException

My saga continues -

In short I'm trying to create a teststack for spark - aim being to read a file from an s3 bucket and then write it to another. Windows env.

I was repeatedly encountering errors when trying to access S3 or S3n as a ClassNotFoundException was being thrown. These classes were added to the core-site.xml as the s3 and s3n.impl

I added the hadoop/share/tools/lib to the classpath to no avail, I then added the aws-java-jdk and hadoop-aws jars to the share/hadoop/common folder and I am now able to list the contents of a bucket using haddop on the command line.

hadoop fs -ls "s3n://bucket" shows me the contents, this is great news :)

In my mind the hadoop configuration should be picked up by spark so solving one should solve the other however when I run spark-shell and try to save a file to s3 I get the usual ClassNotFoundException as shown below.

I'm still quite new to this and unsure if I've missed something obvious, hopefully someone can help me solve the riddle? Any help is greatly appreciated, thanks.

The exception:

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)

my core-site.xml(which I believe to be correct now as hadoop can access s3):

    <property>
  <name>fs.s3.impl</name>
  <value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>

<property>
    <name>fs.s3n.impl</name>
    <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
    <description>The FileSystem for s3n: (Native S3) uris.</description>
</property>

and finally the hadoop-env.cmd showing the classpath(which is seemingly ignored):

    set HADOOP_CONF_DIR=C:\Spark\hadoop\etc\hadoop

@rem ##added as s3 filesystem not found.http://stackoverflow.com/questions/28029134/how-can-i-access-s3-s3n-from-a-local-hadoop-2-6-installation
set HADOOP_USER_CLASSPATH_FIRST=true
set HADOOP_CLASSPATH=%HADOOP_CLASSPATH%:%HADOOP_HOME%\share\hadoop\tools\lib\*

@rem Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.
if exist %HADOOP_HOME%\contrib\capacity-scheduler (
  if not defined HADOOP_CLASSPATH (
    set HADOOP_CLASSPATH=%HADOOP_HOME%\contrib\capacity-scheduler\*.jar
  ) else (
    set HADOOP_CLASSPATH=%HADOOP_CLASSPATH%;%HADOOP_HOME%\contrib\capacity-scheduler\*.jar
  )
)

EDIT: spark-defaults.conf

spark.driver.extraClassPath=C:\Spark\hadoop\share\hadoop\common\lib\hadoop-aws-2.7.1.jar:C:\Spark\hadoop\share\hadoop\common\lib\aws-java-sdk-1.7.4.jar
spark.executor.extraClassPath=C:\Spark\hadoop\share\hadoop\common\lib\hadoop-aws-2.7.1.jar:C:\Spark\hadoop\share\hadoop\common\lib\aws-java-sdk-1.7.4.jar

Upvotes: 0

Views: 2154

Answers (1)

avloss
avloss

Reputation: 2636

You need to pass some parameters to your spark-shell. Try this flag --packages org.apache.hadoop:hadoop-aws:2.7.2 .

Upvotes: 2

Related Questions