raphaelauv
raphaelauv

Reputation: 980

spark aws S3a ARN (Amazon Resource Name) IAM role

I'm using spark 2.3.0 and Hadoop 2.7 ( but i can upgrade if necessary)

I want acces S3 file with an ARN (Amazon Resource Name) IAM Role https://docs.aws.amazon.com/cli/latest/userguide/cli-multiple-profiles.html

I already took a look to this How to access s3a:// files from Apache Spark? but there is no question about IAM acces

public class test {

    public static void main(String[] args) {
        SparkSession sc = new SparkSession.Builder()
                .appName("test")
                .config("spark.master", "local[*]") //for example
                .config("spark.hadoop.fs.s3a.access.key", "****")
                .config("spark.hadoop.fs.s3a.secret.key", "****")
                // .config("spark.hadoop.fs.s3a.arn_role","arn:aws:iam::***:role/******"")
                .getOrCreate();

        sc.read().format("csv").load("s3a://toto/****.csv").printSchema();

    }
}

I didn't find any option or configuration

I'm also looking for a solution with args on the spark submit ,but not inside a configuration files ( this need to by dynamic )

do you have any idea ?

Upvotes: 1

Views: 3256

Answers (2)

stevel
stevel

Reputation: 13480

Explicit support for IAM assumed roles is a very new feature in the S3A code HADOOP-15141, and still not completely stable HADOOP-15583, so you won't gain anything by upgrading.

What could is the session credential support of 2.8 HADOOP-12537

Here you'd need to somehow get the temporary credentials for your IAM role (maybe AWS CLI? If not, a little bit of the AWS SDK lets you do this. Imagine a mix of this code and this.

The assumeRole code gives you the session credential set of (access key, secret key, session token), which you then need to set in the spark context, and switch the credential provider to the temporary provider, as covered here.

you should then be able to work through spark in that IAM Role until the session expires (which has now been extended to last a few hours; until March 2018 they only lasted a few minutes).

The full IAM role support in Hadoop 3.1+ lets you declare the IAM role and any extra policy, and have the connector automatically log you in then refresh the session tokens regularly. You won't have that, so your spark job can't last longer than the lifespan of the credentials you obtained at launch time.

Upvotes: 1

Dominic Nguyen
Dominic Nguyen

Reputation: 813

If you run spark on ec2 and want to use IAM role then you don't need to change your code, just create a role in IAM console and assign to your ec2. Everything that run on that instance inherit the role privileges.

If you run on EMR, create role and specify the role arn in lambda script that calls the EMR cluster API, access the role arn via lambda environment parameter.

Upvotes: 1

Related Questions