Datasink write fails when writing parquet but not CSV

Question

I have the following line of code

val datasink3 = glueContext
  .getSinkWithFormat(
     connectionType = "s3", 
     options = JsonOptions(Map("path" -> outputPath)),
     format = "parquet", 
     transformationContext = "datasink3")
  .writeDynamicFrame(repartitionedDataSource3)

This write fails with

Exception in User Class: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception : Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 9K7H4CDMRM3AM51H; S3 Extended Request ID: DgRwQ8tvq2FjlmVJ4GkBjYW5xwN8lMYtoStvpe8zRr+bSx0pwcybYDSuZYXXJN0pF1pWHiziuAI=)

However, if I switch the write to

val datasink3 = glueContext
  .getSinkWithFormat(
     connectionType = "s3", 
     options = JsonOptions(Map("path" -> outputPath)),
     format = "csv", 
     transformationContext = "datasink3")
  .writeDynamicFrame(repartitionedDataSource3)

It works! What the hell!

The IAM policy has the following perms, none of the resource-level permissions restrict on filetype

"Statement": [
  {
    "Sid": "VisualEditor0",
    "Effect": "Allow",
    "Action": [
      "s3:PutObject",
      "s3:GetObject",
      "s3:ListBucket",
      "s3:DeleteObject"
    ]

Any ideas? This is weird as hell

ronald mcdolittle · Accepted Answer

Here's the issue. I had permissioned the role to only have access to certain folders, i.e.

bucket/toplevelfolder/subfolder*

Glue uses Spark as its ETL engine under the hood. Hence, the Glue job attempted to create a Spark placeholder object named "toplevelfolder%24folder%24" in the path "s3://bucket/" (prior to writing into the actual destination), over which the Role does not have access.

By simply adding S3 permissions on this specific path "s3://bucket/*", the role was able to write the necessary placeholder objects before accessing the prefix into which Spark (glue job) outputted the data.

This only occurs with parquet files because when we write Parquet, by default it creates temporary folders with s3/s3n. This is due to EMRFS implementation mentioned in

https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/

Datasink write fails when writing parquet but not CSV

Answers (1)

Related Questions