EMR Spark Fails to Save Dataframe to S3

I am using the RunJobFlow command to spin up a Spark EMR cluster. This command sets the JobFlowRole to an IAM Role which has the policies AmazonElasticMapReduceforEC2Role and AmazonRedshiftReadOnlyAccess. The first policy contains an action to allow all s3 permissions.

When the EC2 instances spin up, they assume this IAM role, and generate temporary credentials via STS.

The first thing which I do is read a table from my Redshift cluster into a Spark Dataframe using the com.databricks.spark.redshift format and using the same IAM Role to unload the data from redshift as I did for the EMR JobFlowRole.

So far as I understand, this runs an UNLOAD command on Redshift to dump into the S3 bucket I specify. Spark then loads the newly unloaded data into a Dataframe. I use the recommended s3n:// protocol for the tempdir option.

This command works great, and it always successfully loads the data into the Dataframe.

I then run some transformations and attempt to save the dataframe in the csv format to the same S3 bucket Redshift Unloaded into.

However, when I try to do this, it throws the following error

java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively)

Okay. So I don't know why this happens, but I tried to hack around it by setting the recommended hadoop configuration parameters. I then used DefaultAWSCredentialsProviderChain to load the AWSAccessKeyID and AWSSecretKey and set via

spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", <CREDENTIALS_ACCESS_KEY>) spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", <CREDENTIALS_SECRET_ACCESS_KEY>)

When I run it again it throws the following error:

java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId;

Okay. So that didn't work. I then removed setting the hadoop configurations and hardcoded an IAM user's credentials in the s3 url via s3n://ACCESS_KEY:SECRET_KEY@BUCKET/KEY

When I ran this it spit out the following error:

java.lang.IllegalArgumentException: Bucket name should be between 3 and 63 characters long

So it tried to create a bucket.. which is definitely not what we want it to do.

I am really stuck on this one and would really appreciate any help here! It works fine when I run it locally, but completely fails on EMR.

Upvotes: 3

Answers (2)

robert

Reputation: 819

The problem was the following:

EC2 Instance Generated Temporary Credentials on EMR Bootstrap Phase
When I queried Redshift, I passed the aws_iam_role to theDatabricks driver. The driver then re-generated temporary credentials for that same IAM role. This invalidated the credentials the EC2 instance generated.
I then tried to upload to S3 using the old credentials (and the credentials which were stored in the instance's metadata)

It failed because it was trying to use out-of-date credentials.

The solution was to remove redshift authorization via aws_iam_role and replace it with the following:

val credentials = EC2MetadataUtils.getIAMSecurityCredentials ... .option("temporary_aws_access_key_id", credentials.get(IAM_ROLE).accessKeyId) .option("temporary_aws_secret_access_key", credentials.get(IAM_ROLE).secretAccessKey) .option("temporary_aws_session_token", credentials.get(IAM_ROLE).token)

Upvotes: 1

stevel

Reputation: 13430

On amazon EMR, try usong the prefix s3:// to refer to an object in S3.

It's a long story.

Upvotes: 0

EMR Spark Fails to Save Dataframe to S3

Answers (2)

Related Questions