ffriend
ffriend

Reputation: 28492

Write to S3 from Spark without access and secret keys

Our EC2 server is configured to allow access to my-bucket when using DefaultAWSCredentialsProviderChain, so the following code using plain AWS SDK works fine:

AmazonS3 s3client = new AmazonS3Client(new DefaultAWSCredentialsProviderChain());
s3client.putObject(new PutObjectRequest("my-bucket", "my-object", "/path/to/my-file.txt"));

Spark's S3AOutputStream uses the same SDK internally, however trying to upload a file without providing acces and secret keys doesn't work:

sc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
// not setting access and secret key
JavaRDD<String> rdd = sc.parallelize(Arrays.asList("hello", "stackoverflow"));
rdd.saveAsTextFile("s3a://my-bucket/my-file-txt");

gives:

Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: 25DF243A166206A0, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: Ki5SP11xQEMKb0m0UZNXb4FhfWLMdbehbknQ+jeZuO/wjhwurjkFoEYVfrQfW1KIq435Lo9jPkw=  
    at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)  
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)  
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)  
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)  
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
    at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
    at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:130)
    <truncated>

Is there a way to force Spark to use default credential provider chain instead of relying on access and secret key?

Upvotes: 1

Views: 1506

Answers (1)

stevel
stevel

Reputation: 13430

technically, Hadoop's s3a output stream. Look at the stack trace to see who to file bugreports against :)

And s3a does support Instance Credentials from Hadoop 2.7+, proof.

If you can't connect, you need to have the 2.7 JARs on your CP, along with the exact version of the AWS SDK is used (1.7.4, I recall).

Spark has one little feature: if you submit work and you have the AWS_* env vars set, then it picks them up, copies them in as the fs.s3a keys, so propagating them to your systems.

Upvotes: 1

Related Questions