Reputation: 71
I am facing the following error while writing to S3 bucket using pyspark.
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: A0B0C0000000DEF0, AWS Error Code: InvalidArgument, AWS Error Message: Requests specifying Server Side Encryption with AWS KMS managed keys require AWS Signature Version 4.,
I have applied server-side encryption using AWS KMS service on the S3 bucket. I am using the following spark-submit command -
spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 --jars sample-jar sample_pyspark.py
This is the sample code I am working on -
spark_context = SparkContext()
sql_context = SQLContext(spark_context)
spark = SparkSession.builder.appName('abc').getOrCreate()
hadoopConf = spark_context._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
#Have a spark dataframe 'source_data
source_data.coalesce(1).write.mode('overwrite').parquet("s3a://sample-bucket")
Note: Tried to load the spark-dataframe into s3 bucket [without server-side encryption enabled] and it was successful
Upvotes: 5
Views: 9412
Reputation: 426
The error seems to be telling you to enable V4 S3 signatures on the Amazon SDK. One way to do it is from the command line:
spark-submit --conf spark.driver.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
--conf spark.executor.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' \
... (other spark options)
That said, I agree with Steve that you should use a more recent hadoop library.
References:
Upvotes: 2