Reputation: 109
I have an ETL job that runs daily, uses bookmarks and writes the increment to some output s3 bucket. The output bucket is partitioned by one key.
Now, I want to have just one file by each partition. I can achieve that on the first run of the job as following:
datasource = datasource.repartition(1)
glueContext.write_dynamic_frame.from_options(
connection_type = "s3",
frame = datasource,
connection_options = {"path":output_path, "partitionKeys": ["a_key"]},
format = "glueparquet",format_options={"compression":"gzip"},
transformation_ctx = "write_dynamic_frame")
What I can't figure out is how to write and compress my increment with the files that are already in my output bucket/partition. One option would be to read the table from the previous day and merge it with the increment, but it seems like an overkill.
Any smarter ideas?
Upvotes: 1
Views: 3149
Reputation: 439
I was running into the same issue, and discovered that the compression setting goes in the connection_options:
connection_options = {"path": file_path, "compression": "gzip", "partitionKeys": ["a_key"]}
Upvotes: 3