Reputation: 11
I wrote a file compactor using pyspark. The way that it works is by reading all the content of a directory into a spark dataframe and then performing a repartition action in order to reduce the number of files. The number of desired files is calculated by: directory_size / wanted_file_size.
The issue I'm facing is that the size of directory after compaction is bigger then the size of directory before compaction.
for example: Directory size before compaction is 68gb with 1431 files. after compaction the file count is 145 and the directory size is 76gb. all files in this example are parquet and compressed with snappy before the compaction sequence and after.
can someone please help me understand why does the directory size is changing and how can i fix this??
Upvotes: 1
Views: 342
Reputation: 1
This appears to be an issue with your configuration, it's not ditributed enough. try to increase the number of workers and it'll figure out your issue. this is a classic issue with data skew
Upvotes: 0