ClassicThunder
ClassicThunder

Reputation: 1936

Spark s3a throws 403 error while same configuration works for AwsS3Client

Below are my versions for everything

<spark.version>2.3.1</spark.version>
<scala.version>2.11.8</scala.version>
<hadoop.version>2.7.7</hadoop.version>

<artifactId>aws-java-sdk</artifactId>
<version>1.7.4</version>

And I have the following code that is submitted to spark-submit as part of a fat jar.

spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("log4j.logger.org.apache.hadoop.fs.s3a", "DEBUG")

spark.sparkContext.hadoopConfiguration.set("fs.s3a.server-side-encryption-algorithm", "AES256")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", endpoint)

spark.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", access)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", secret)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.session.token", session)

spark.sparkContext.hadoopConfiguration.set("fs.s3a.proxy.host", proxyHost)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.proxy.port", proxyPort.toString)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.proxy.username", proxyUser)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.proxy.password", proxyPass)

val credentials = new StaticCredentialsProvider(new BasicSessionCredentials(access, secret, session))
val config = new ClientConfiguration()
  .withProxyHost(proxyHost)
  .withProxyPort(proxyPort)
  .withProxyUsername(proxyUser)
  .withProxyPassword(proxyPass)
val s3Client = new AmazonS3Client(credentials, config)
s3Client.setEndpoint(endpoint)

val `object` = s3Client.getObject(new GetObjectRequest(bucket, key))
val objectData = `object`.getObjectContent
println("This works! :) " + objectData.toString)

val json = spark.read.textFile("s3a://" + bucket + "/" + key)
println("Error before here :( " + json)

The call using the AmazonS3Client works

This works! :) com.amazonaws.services.s3.model.S3ObjectInputStream@3f736a16

But I get the below error leveraging s3a

2018-09-12 20:45:59 INFO  S3AFileSystem:1207 - Caught an AmazonServiceException com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: D8A113B7B1AB31B9, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: AybHBDYJCeWlw2brLdL0Ezpg5PNTUs9kxUqr17xR6qnv3WTxUQ0T1Vs78aM9mG8bsjTzguePZG0=
2018-09-12 20:45:59 INFO  S3AFileSystem:1208 - Error Message: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: D8A113B7B1AB31B9, AWS Error Code: null, AWS Error Message: Forbidden
2018-09-12 20:45:59 INFO  S3AFileSystem:1209 - HTTP Status Code: 403
2018-09-12 20:45:59 INFO  S3AFileSystem:1210 - AWS Error Code: null
2018-09-12 20:45:59 INFO  S3AFileSystem:1211 - Error Type: Client
2018-09-12 20:45:59 INFO  S3AFileSystem:1212 - Request ID: D8A113B7B1AB31B9
2018-09-12 20:45:59 INFO  S3AFileSystem:1213 - Stack
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: D8A113B7B1AB31B9, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: AybHBDYJCeWlw2brLdL0Ezpg5PNTUs9kxUqr17xR6qnv3WTxUQ0T1Vs78aM9mG8bsjTzguePZG0=
    at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
    at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
    at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1439)
    at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
    at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
    at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:732)
    at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:702)
    at com.company.HelloWorld$.main(HelloWorld.scala:77)
    at com.company.HelloWorld.main(HelloWorld.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

As far as I know they should be configured identically. So I am at a loss as to why the client works but s3a is getting a 403 error?

Upvotes: 3

Views: 4401

Answers (2)

ClassicThunder
ClassicThunder

Reputation: 1936

I managed to fix the issue by removing AWS Java SDK

<dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>aws-java-sdk</artifactId>
    <version>1.7.4</version>
</dependency>

and replacing it with the 2.8.1 version of Hadoop AWS dependency.

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-aws</artifactId>
    <version>2.8.1</version>
</dependency>

Upvotes: 3

stevel
stevel

Reputation: 13430

Nothing obvious. That log4j setting needs to go to log4j.properties, but as the authentication chain deliberately avoids logging anything useful, it won't help that much

  1. Hadoop S3A troubleshooting docs. This is the history of all stack traces seen.
  2. try the cloudstore tool, explicitly written to do basic connector debug and generate logs which can be safely included in support calls.

Upvotes: 1

Related Questions