artsince
artsince

Reputation: 1032

spark saveAsTextFile to s3 fails

I have a Spark process that takes two input files from S3. At the end of the job, I simply want to write the results back into S3 with saveAsTextFile method. However, I am getting Access Denied errors.

My policy rule is wide open to make sure I don't have any permission errors:

{
    "Version": "2012-10-17",
    "Id": "Policy1457106962648",
    "Statement": [
        {
            "Sid": "Stmt1457106959104",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": "arn:aws:s3:::<bucket-name>/*"
        }
    ]
}

I set my credentials on SparkContext like the following:

SparkConf conf = new SparkConf()
                .setAppName("GraphAnalyser")
                .setMaster("local[*]")
                .set("spark.driver.memory", "2G");
                .set("spark.hadoop.fs.s3.awsAccessKeyId", [access-key])
                .set("spark.hadoop.fs.s3n.awsAccessKeyId", [access-key])
                .set("spark.hadoop.fs.s3.awsSecretAccessKey", [secret-key])
                .set("spark.hadoop.fs.s3n.awsSecretAccessKey", [secret-key]);

And I use pass file URLs with the s3n protocol:

final String SC_NODES_FILE  = "s3n://" + BUCKET_NAME + "/" + NODES_FILE;
final String SC_EDGES_FILE  = "s3n://" + BUCKET_NAME + "/" + EDGES_FILE;
final String SC_OUTPUT_FILE = "s3n://" + BUCKET_NAME + "/output";

Note that I have no trouble with accessing the input files. It seems like Spark sends a HEAD request for the output file, to make sure it does not exist before it attempts to save final results. Since, s3 returns Access Denied instead of Not Found. That is probably the reason why Spark throws an Exception and exits.

org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/output.csv' - ResponseCode=403, ResponseMessage=Forbidden

Spark 1.6.0 aws-java-sdk (1.10.58) spark-core_2.10 (1.6.0)

Your help is appreciated. Thank you very much.

Upvotes: 1

Views: 1417

Answers (1)

artsince
artsince

Reputation: 1032

answering my own question

It turns out that I needed the s3:ListBucket action, which is only applicable when resource is the bucket itself, not the keys inside the bucket.

In my original policy file I had the following resource:

"Resource": "arn:aws:s3:::<bucket-name>/*"

I had to add:

"Resource": "arn:aws:s3:::<bucket-name>/*"

Here's my final policy file that works for me:

{
  "Id": "Policy145712123124123",
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Stmt145712812312323",
      "Action": [
        "s3:DeleteObject",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:PutObject"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::<bucket-name>",
        "arn:aws:s3:::<bucket-name>/*"
      ],
      "Principal": {
        "AWS": [
          "arn:aws:iam::<account-id>:user/<user-name>"
        ]
      }
    }
  ]
}

Upvotes: 3

Related Questions