Reputation: 138
It appears to be dividing the output of my job into several very small CSV files when it saves them in S3, but some of them are 0 B.
Just to clarify: this is the output produced by just 1 job execution. I've configured the job to write the result in S3 as CSV. Glue Databrew dumps the output in several CSVs for some reason, and thats ok, but why some of them are empty?
Upvotes: 2
Views: 896
Reputation: 370
When outputting, Glue will create a file for each thread that was running (really these are Hadoop threads under Spark). If a thread ran, but ended up having nothing to output it will output an empty file.
If you wanted to consolidate the threads back into a single thread (and therefore single file) before outputting you could do something like the code below (assuming Python). This assumes that frame variable below is a Glue DynamicFrame.
Note that this will have some performance impact.
frame = frame.repartition(1)# ensure output is a single file
Do this just before you do your write of your frame out to S3.
Upvotes: 3