Dylan
Dylan

Reputation: 913

Pyspark 2.4.0 hadoopConfiguration to write to S3

Pyspark version 2.4.0

I'm writing files to an S3 I don't own. Then everyone is having trouble reading the file. I think the issue is similar to this How to assign the access control list (ACL) when writing a CSV file to AWS in pyspark (2.2.0)?

But the solution seems no longer working. Searched across Pyspark doc but didn't get an answer. I tried:

from pyspark.sql import SparkSession
spark = SparkSession.\
    builder.\
    master("yarn").\
    appName(app_name).\
    enableHiveSupport().\
    getOrCreate()
spark.sparkContext.hadoopConfiguration.set("fs.s3a.acl.default", "BucketOwnerFullControl")

This is giving me: ERROR - {"exception": "'SparkContext' object has no attribute 'hadoopConfiguration'"

Upvotes: 0

Views: 2801

Answers (1)

Napoleon Borntoparty
Napoleon Borntoparty

Reputation: 1972

There's two issues at hand.

  1. In order to set new config, you need to getOrCreate() your SparkSession again with the new config. You won't be able to just set. For example:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").getOrCreate()
sc = spark.sparkContext
conf = pyspark.SparkConf().setAll([('spark.executor.memory', '1g')])

# stop the sparkContext and set new conf
sc.stop()
spark = SparkSession.builder.config(conf=conf).getOrCreate()
  1. In order to set Hadoop Config, you need to prepend them with spark.hadoop. This means your config will become
conf = pyspark.SparkConf().setAll([("spark.hadoop.fs.s3a.acl.default", "BucketOwnerFullControl")])

Hope this helps.

Upvotes: 1

Related Questions