PySpark optimize using partition by size file and compression strategy Parquet file

Question

I have a PySpark DataFrame result after preprocessing and ETL. I can calculate the current size of DataFrame using the following syntax:

size_estimator = spark._jvm.org.apache.spark.util.SizeEstimator
df_size = size_estimator.estimate(df_current._jdf)
# Write, maximum 128M each file to handling case small file
num_file = max(df_size // (128 * 1024 * 1024) + 1, 1)
df_current.repartition(num_file).write.mode("overwrite").insertInto("schema.table_target")

This approach, however, has a drawback. Since my target table, schema.table_target, uses Parquet format with Snappy compression, the actual file size on HDFS will be significantly reduced. This can potentially lead to small file issues.

For instance, if my DataFrame has an initial size of 130MB, my current method will split it into two subfiles when writing to HDFS. However, due to Snappy compression within the Parquet files, the actual sizes on HDFS might be only 32MB and 1MB. In this ideal scenario, I would expect a single file of approximately 33MB.

Is there any solution for my problems?

Thanks a lot.

PySpark optimize using partition by size file and compression strategy Parquet file

Answers (0)

Related Questions