Reputation: 138

Why does Glue Databrew job write empty CSV files when dumping the output into S3?

It appears to be dividing the output of my job into several very small CSV files when it saves them in S3, but some of them are 0 B.

Just to clarify: this is the output produced by just 1 job execution. I've configured the job to write the result in S3 as CSV. Glue Databrew dumps the output in several CSVs for some reason, and thats ok, but why some of them are empty?

Upvotes: 2

Answers (1)

Jon Legendre

Reputation: 370

When outputting, Glue will create a file for each thread that was running (really these are Hadoop threads under Spark). If a thread ran, but ended up having nothing to output it will output an empty file. If you wanted to consolidate the threads back into a single thread (and therefore single file) before outputting you could do something like the code below (assuming Python). This assumes that frame variable below is a Glue DynamicFrame.
Note that this will have some performance impact.

frame = frame.repartition(1)# ensure output is a single file

Do this just before you do your write of your frame out to S3.

Upvotes: 3

Why does Glue Databrew job write empty CSV files when dumping the output into S3?

Answers (1)

Related Questions