Directory size increased after compaction using pyspark

Question

I wrote a file compactor using pyspark. The way that it works is by reading all the content of a directory into a spark dataframe and then performing a repartition action in order to reduce the number of files. The number of desired files is calculated by: directory_size / wanted_file_size.

The issue I'm facing is that the size of directory after compaction is bigger then the size of directory before compaction.

for example: Directory size before compaction is 68gb with 1431 files. after compaction the file count is 145 and the directory size is 76gb. all files in this example are parquet and compressed with snappy before the compaction sequence and after.

can someone please help me understand why does the directory size is changing and how can i fix this??

Directory size increased after compaction using pyspark

Answers (1)

Related Questions