lte__
lte__

Reputation: 7576

Spark - No FileSystem for scheme: https, cannot load files from Amazon S3

I'm trying to load some data from an Amazon S3 bucket by:

SparkConf sparkConf = new SparkConf().setAppName("Importer");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
HiveContext sqlContext = new HiveContext(ctx.sc());

DataFrame magento = sqlContext.read().json("https://s3.eu-central-1.amazonaws.com/*/*.json");

This last line however throws an error:

Exception in thread "main" java.io.IOException: No FileSystem for scheme: https

The same line has been working in another project, what am I missing? I'm running Spark on a Hortonworks CentOS VM.

Upvotes: 9

Views: 7844

Answers (1)

Piotr Reszke
Piotr Reszke

Reputation: 1596

By default Spark supports HDFS, S3 and local. S3 can be accessed by s3a:// or s3n:// protocols (difference between s3a, s3n and s3 protocols)

So to access a file the best is to use the following:

s3a://bucket-name/key

Depending on your spark version and included libraries you may need to add external jars:

Spark read file from S3 using sc.textFile ("s3n://...)

(Are you sure that you were using s3 with https protocol in previous projects? Maybe you had custom code or jars included to support https protocol?)

Upvotes: 1

Related Questions