filip stepniak
filip stepniak

Reputation: 109

AWS Glue write and compress with the files in output bucket

I have an ETL job that runs daily, uses bookmarks and writes the increment to some output s3 bucket. The output bucket is partitioned by one key.

Now, I want to have just one file by each partition. I can achieve that on the first run of the job as following:

datasource = datasource.repartition(1)

glueContext.write_dynamic_frame.from_options(
connection_type = "s3",
frame = datasource, 
connection_options = {"path":output_path, "partitionKeys": ["a_key"]}, 
format = "glueparquet",format_options={"compression":"gzip"},
transformation_ctx = "write_dynamic_frame")

What I can't figure out is how to write and compress my increment with the files that are already in my output bucket/partition. One option would be to read the table from the previous day and merge it with the increment, but it seems like an overkill.

Any smarter ideas?

Upvotes: 1

Views: 3149

Answers (1)

Starlton
Starlton

Reputation: 439

I was running into the same issue, and discovered that the compression setting goes in the connection_options:

connection_options = {"path": file_path, "compression": "gzip", "partitionKeys": ["a_key"]}

Upvotes: 3

Related Questions