mzdrgwj
mzdrgwj

Reputation: 21

Fail to retrieve data from sse-kms encrypted s3 object in spark

since the current spark environment I work with is spark 2.4 in hadoop2.7, however hadoop2.7 doesn't support SSE-KMS. from apache: HADOOP-13075, it was introduced in 2.8 and full supported after hadoop 3.0. Then from official doc two configure parameter fs.s3a.server-side-encryption-algorithm & fs.s3a.server-side-encryption.key" should be added.

Based on the former docs, I add the package org.apache.hadoop:hadoop-aws:3.1.1 & com.amazonaws:aws-java-sdk:1.9.5 in spark-submit parameter, and add

spark._jsc.hadoopConfiguration().set("fs.s3a.server-side-encryption-algorithm", aws_sse_algorithm)`
spark._jsc.hadoopConfiguration().set("fs.s3a.server-side-encryption.key", aws_sse_key)

to spark config, the aws_sse_algorithm is SSE-KMS & sse_key provided by our admin.

In the meanwhile I basically added all parameters I could to the config. however, I got this exeception:

Requests specifying Server Side Encryption with AWS KMS managed keys require AWS Signature Version 4.

when I retrieve the s3 object in spark:

df = spark.read.json('s3a://XXXXXXX/XXXXX/XXXXXXXX/result.json') 
2019-08-09 14:54:09,525 ERROR executor.Executor: Exception in task 0.0 in stage 4.0 (TID 4)
**com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 7C1C371AE02F476A, AWS Error Code: InvalidArgument, 
AWS Error Message: Requests specifying Server Side Encryption with AWS KMS managed keys require AWS Signature Version 4.**, S3 Extended Request ID: hlCH96//G18Bs47fGJwxt+Ccpdf0YNOadt9bUPYei2InkkUeKCslq/4m353RnQEhopBfvjVIcx0=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1111)
.......

My full codes:

import datetime, time 
from pyspark.sql import SparkSession 
from pyspark.sql import functions as func 
from pyspark.sql.functions import udf 
from pyspark.sql.types import StringType, IntegerType, DoubleType, ArrayType, StructType, StructField, MapType 
import boto3 
import json 
import pytz 
import configparser 
import argparse 
from dateutil.parser import parse

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:3.1.1,org.apache.hadoop:hadoop-common:3.1.1,org.apache.hadoop:hadoop-auth:3.1.1," \ ... "com.amazonaws:aws-java-sdk:1.9.5 " \ ... "pyspark-shell"

spark = SparkSession.builder.appName("test").getOrCreate() aws_sse_algorithm = 'SSE-KMS' 
aws_sse_key = 'arn:aws:kms:ap-southeast-1:XXXXXXX:key/XXXXXX'

aws_access_id = 'XXXXX' 
aws_access_key = 'XXXXX' 
aws_region = 'ap-southeast-1'

spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_id) spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_access_key) spark._jsc.hadoopConfiguration().set("fs.s3a.fast.upload", "true") spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true") spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider") spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3."+aws_region+".amazonaws.com")

spark._jsc.hadoopConfiguration().set("fs.s3a.sse.enabled", "true") spark._jsc.hadoopConfiguration().set("fs.s3a.enableServerSideEncryption", "true")

spark._jsc.hadoopConfiguration().set("fs.s3a.server-side-encryption-algorithm", aws_sse_algorithm) spark._jsc.hadoopConfiguration().set("fs.s3a.server-side-encryption.key", aws_sse_key) spark._jsc.hadoopConfiguration().set("fs.s3a.sse.kms.keyId", aws_sse_key)

df = spark.read.json('s3a://XXXXXXX/XXXXX/XXXXXXXX/result.json')

I am not sure if it was related to hadoop jars in local spark_classpath are still under 2.7.3 version. However I add the 3.1.1 jar to --packages part for spark.

Upvotes: 1

Views: 2578

Answers (2)

stevel
stevel

Reputation: 13430

If you are having to set jvm options for v4 signing to work, then you are still using the hadoop-2.7 s3a implementation.

  • All hadoop-* JARs must be exactly the same version, or you will see stack traces
  • the aws-sdk version must be exactly the same version which hadoop-aws was built and tested against or you will see different stack traces.

Until you have that consistent set of JARs you are, sadly, doomed. You'll only end up fritterig away time moving the stack traces around. Get those dependencies right first.

Which means: move up to the Hadoop 2.8+ artifacts. Entirely

Upvotes: 1

mzdrgwj
mzdrgwj

Reputation: 21

I figure it out, the correct configure for amazon s3 signature v4 is:

spark.sparkContext.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")

not

spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")

Upvotes: 1

Related Questions