spark aws S3a ARN (Amazon Resource Name) IAM role

Question

I'm using spark 2.3.0 and Hadoop 2.7 ( but i can upgrade if necessary)

I want acces S3 file with an ARN (Amazon Resource Name) IAM Role https://docs.aws.amazon.com/cli/latest/userguide/cli-multiple-profiles.html

I already took a look to this How to access s3a:// files from Apache Spark? but there is no question about IAM acces

public class test {

    public static void main(String[] args) {
        SparkSession sc = new SparkSession.Builder()
                .appName("test")
                .config("spark.master", "local[*]") //for example
                .config("spark.hadoop.fs.s3a.access.key", "****")
                .config("spark.hadoop.fs.s3a.secret.key", "****")
                // .config("spark.hadoop.fs.s3a.arn_role","arn:aws:iam::***:role/******"")
                .getOrCreate();

        sc.read().format("csv").load("s3a://toto/****.csv").printSchema();

    }
}

I didn't find any option or configuration

I'm also looking for a solution with args on the spark submit ,but not inside a configuration files ( this need to by dynamic )

do you have any idea ?

stevel · Accepted Answer

Explicit support for IAM assumed roles is a very new feature in the S3A code HADOOP-15141, and still not completely stable HADOOP-15583, so you won't gain anything by upgrading.

What could is the session credential support of 2.8 HADOOP-12537

Here you'd need to somehow get the temporary credentials for your IAM role (maybe AWS CLI? If not, a little bit of the AWS SDK lets you do this. Imagine a mix of this code and this.

The assumeRole code gives you the session credential set of (access key, secret key, session token), which you then need to set in the spark context, and switch the credential provider to the temporary provider, as covered here.

you should then be able to work through spark in that IAM Role until the session expires (which has now been extended to last a few hours; until March 2018 they only lasted a few minutes).

The full IAM role support in Hadoop 3.1+ lets you declare the IAM role and any extra policy, and have the connector automatically log you in then refresh the session tokens regularly. You won't have that, so your spark job can't last longer than the lifespan of the credentials you obtained at launch time.

spark aws S3a ARN (Amazon Resource Name) IAM role

Answers (2)

Related Questions