S Dub
S Dub

Reputation: 63

Can't read from S3 bucket with s3 protocol, s3a only

I've been through all the threads on the dependencies for connecting spark running on an aws EMR to an s3 bucket, however my issue seems to be slightly different. In all of the other discussions I have seen, the s3 and s3a protocols have the same dependencies. Not sure why one is working for me while the other is not. Currently, running spark in local mode, s3a does the job just fine, but my understanding is that s3 is what's supported running on EMR (due to its reliance on HDFS block storage). What am I missing for the s3 protocol to work?

spark.read.format("csv").load("s3a://mybucket/testfile.csv").show()
//this works, displays the df

versus

spark.read.format("csv").load("s3://mybucket/testfile.csv").show()
/*
java.io.IOException: No FileSystem for scheme: s3
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:547)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:355)
  at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  ... 51 elided
*/

Upvotes: 2

Views: 3981

Answers (1)

hagarwal
hagarwal

Reputation: 1163

Apache Hadoop provides the following filesystem clients for reading from and writing to Amazon S3:

  1. S3 (URI scheme: s3) - Apache Hadoop implementation of a block-based filesystem backed by S3.

  2. S3A (URI scheme: s3a) - S3A uses Amazon’s libraries to interact with S3. S3A supports accessing files larger than 5 GB and up to 5TB, and it provides performance enhancements and other improvements.

  3. S3N (URI scheme: s3n) - A native filesystem for reading and writing regular files on S3. s3n supports objects up to 5GB in size

Reference:

Technically what is the difference between s3n, s3a and s3?

https://web.archive.org/web/20170718025436/https://aws.amazon.com/premiumsupport/knowledge-center/emr-file-system-s3/

Upvotes: 1

Related Questions