Reputation: 1734
While using spark 2.3.0, hadoop-aws 2.7.6 I tried to read from s3
spark.sparkContext.textFile("s3a://ap-northeast-2-bucket/file-1").take(10)
But AmazonS3Exception raised.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 202ABEDF0E955321, AWS Error Code: null, AWS Error Message: Bad Request
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
...
I launched ec2 instance with Instance Profile, so AWS SDK are using InstaneProfileCredential, and in console I can use AWS CLI successfuly
aws s3 ls ap-northeast-2-bucket
aws s3 cp s3://ap-northeast-2-bucket/file-a file-a
I did set fs.s3a.endpoint
to s3.ap-northeast-2.amazonaws.com
in spark-defaults.conf
# spark-defaults.conf
spark.hadoop.fs.s3a.endpoint s3.ap-northeast-2.amazonaws.com
Upvotes: 1
Views: 1506
Reputation: 1734
This was caused by combination of many facts.
I was using spark 2.3.0 with hadoop 2.7. So I was using hadoop-aws 2.7.6 and then by dependency aws-java-sdk version is 1.7.4.
My bucket is located in Seoul (ap-northeast-2) and (Seoul and Frankfurt) region only support V4 signing mechanism. So I should set endpoint for aws-sdk to use V4 properly. This can be fixed by setting hadoop conf
spark.hadoop.fs.s3a.endpoint s3.ap-northeast-2.amazonaws.com
And aws-java-sdk released before June 2016 is using V2 signing mechanism as default. So I should explicitly set aws-sdk to use V4. This can be fixed by setting java system property.
import com.amazonaws.SDKGlobalConfiguration
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true")
If both fix is not applied, BadRequest error occurs.
Upvotes: 1