Reputation: 21
since the current spark environment I work with is spark 2.4 in hadoop2.7, however hadoop2.7 doesn't support SSE-KMS.
from apache: HADOOP-13075, it was introduced in 2.8 and full supported after hadoop 3.0. Then from official doc
two configure parameter fs.s3a.server-side-encryption-algorithm
& fs.s3a.server-side-encryption.key
" should be added.
Based on the former docs, I add the package org.apache.hadoop:hadoop-aws:3.1.1
& com.amazonaws:aws-java-sdk:1.9.5
in spark-submit
parameter, and add
spark._jsc.hadoopConfiguration().set("fs.s3a.server-side-encryption-algorithm", aws_sse_algorithm)`
spark._jsc.hadoopConfiguration().set("fs.s3a.server-side-encryption.key", aws_sse_key)
to spark config
, the aws_sse_algorithm
is SSE-KMS
& sse_key
provided by our admin.
In the meanwhile I basically added all parameters I could to the config. however, I got this exeception:
Requests specifying Server Side Encryption with AWS KMS managed keys require AWS Signature Version 4.
when I retrieve the s3 object in spark:
df = spark.read.json('s3a://XXXXXXX/XXXXX/XXXXXXXX/result.json')
2019-08-09 14:54:09,525 ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID 4)
**com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 7C1C371AE02F476A, AWS Error Code: InvalidArgument,
AWS Error Message: Requests specifying Server Side Encryption with AWS KMS managed keys require AWS Signature Version 4.**, S3 Extended Request ID: hlCH96//G18Bs47fGJwxt+Ccpdf0YNOadt9bUPYei2InkkUeKCslq/4m353RnQEhopBfvjVIcx0=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1111)
.......
My full codes:
import datetime, time
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, IntegerType, DoubleType, ArrayType, StructType, StructField, MapType
import boto3
import json
import pytz
import configparser
import argparse
from dateutil.parser import parse
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:3.1.1,org.apache.hadoop:hadoop-common:3.1.1,org.apache.hadoop:hadoop-auth:3.1.1," \ ... "com.amazonaws:aws-java-sdk:1.9.5 " \ ... "pyspark-shell"
spark = SparkSession.builder.appName("test").getOrCreate() aws_sse_algorithm = 'SSE-KMS'
aws_sse_key = 'arn:aws:kms:ap-southeast-1:XXXXXXX:key/XXXXXX'
aws_access_id = 'XXXXX'
aws_access_key = 'XXXXX'
aws_region = 'ap-southeast-1'
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_id) spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_access_key) spark._jsc.hadoopConfiguration().set("fs.s3a.fast.upload", "true") spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true") spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider") spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3."+aws_region+".amazonaws.com")
spark._jsc.hadoopConfiguration().set("fs.s3a.sse.enabled", "true") spark._jsc.hadoopConfiguration().set("fs.s3a.enableServerSideEncryption", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.server-side-encryption-algorithm", aws_sse_algorithm) spark._jsc.hadoopConfiguration().set("fs.s3a.server-side-encryption.key", aws_sse_key) spark._jsc.hadoopConfiguration().set("fs.s3a.sse.kms.keyId", aws_sse_key)
df = spark.read.json('s3a://XXXXXXX/XXXXX/XXXXXXXX/result.json')
I am not sure if it was related to hadoop jars in local spark_classpath are still under 2.7.3 version. However I add the 3.1.1 jar to --packages part for spark.
Upvotes: 1
Views: 2578
Reputation: 13430
If you are having to set jvm options for v4 signing to work, then you are still using the hadoop-2.7 s3a implementation.
Until you have that consistent set of JARs you are, sadly, doomed. You'll only end up fritterig away time moving the stack traces around. Get those dependencies right first.
Which means: move up to the Hadoop 2.8+ artifacts. Entirely
Upvotes: 1
Reputation: 21
I figure it out, the correct configure for amazon s3 signature v4 is:
spark.sparkContext.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
not
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
Upvotes: 1