Vikash Pareek
Vikash Pareek

Reputation: 1181

How to write a spark rdd to S3 using server side encryption

I am trying to write an RDD into S3 with server side encryption. Following is my piece of code.

val sparkConf = new SparkConf().
  setMaster("local[*]").
  setAppName("aws-encryption")
val sc = new SparkContext(sparkConf)
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", AWS_ACCESS_KEY)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", AWS_SECRET_KEY)
sc.hadoopConfiguration.setBoolean("fs.s3n.sse.enabled", true)
sc.hadoopConfiguration.set("fs.s3n.enableServerSideEncryption", "true")
sc.hadoopConfiguration.setBoolean("fs.s3n.enableServerSideEncryption", true)
sc.hadoopConfiguration.set("fs.s3n.sse", "SSE-KMS")
sc.hadoopConfiguration.set("fs.s3n.serverSideEncryptionAlgorithm", "SSE-KMS")
sc.hadoopConfiguration.set("fs.s3n.server-side-encryption-algorithm", "SSE-KMS")
sc.hadoopConfiguration.set("fs.s3n.sse.kms.keyId", KMS_ID)
sc.hadoopConfiguration.set("fs.s3n.serverSideEncryptionKey", KMS_ID)

val rdd = sc.parallelize(Seq("one", "two", "three", "four"))
rdd.saveAsTextFile(s"s3n://$bucket/$objKey")

This code is writing RDD on S3 but without encryption. [I have checked properties of the written object and it shows server-side encrypted is "no".] Am I skipping anything here or using any property incorrectly?

Any suggestion would be appreciated.

P.S. I have set same properties with different name, reason being I am not sure when to use which name for e.g.

sc.hadoopConfiguration.setBoolean("fs.s3n.sse.enabled", true)
sc.hadoopConfiguration.set("fs.s3n.enableServerSideEncryption", "true")
sc.hadoopConfiguration.setBoolean("fs.s3n.enableServerSideEncryption", true)

Thank you.

Upvotes: 3

Views: 8637

Answers (1)

stevel
stevel

Reputation: 13490

  1. stop using s3n, switch to s3a. I don't remember what s3n does with encryption, but you should move on performance and scale alone.
  2. start with SSE-S3 over SSE-KMS, as it's easier to set up
  3. turn on encryption in the client via the relevant s3a properties (see below)
  4. add a bucket policy to mandate encryption. That makes sure all clients are always set up right.

Example policy

<property>
  <name>fs.s3a.server-side-encryption-algorithm</name>
  <value>AES256</value>
</property>

See Working with Encrypted Amazon S3 Data; these are the current (oct 2019) best docs on encrypting S3 with s3A & hadoop, spark & hive

AWS EMR readers: None of this applies to you. Switch to Apache Hadoop or look up the EMR docs.

Upvotes: 2

Related Questions