Reputation: 55
I am writing my dataframe like below
df.write().format("com.databricks.spark.avro").save("path");
However I am getting around 200 files where around 30-40 files are empty.I can understand that it might be due to empty partitions. I then updated my code like
df.coalesce(50).write().format("com.databricks.spark.avro").save("path");
But I feel it might impact performance. Is there any other better approach to limit number of output files and remove empty files
Upvotes: 0
Views: 5209
Reputation: 1057
As default no. of RDD partitions is 200; you have to do shuffle to remove skewed partitions.
You can either use repartition
method on the RDD; or make use of DISTRIBUTE BY
clause on dataframe - which will repartition along with distributing data among partitions evenly.
def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
Returns dataset instance with proper partitions.
You may use repartitionAndSortWithinPartitions
- which can improve compression ratio.
Upvotes: 1
Reputation: 11587
repartition your dataframe using this method. To eliminate skew and ensure even distribution of data choose column(s) in your dataframe with high cardinality (having unique number of values in the columns) for the partitionExprs argument to ensure even distribution.
Upvotes: 1
Reputation: 23109
You can remove the empty partitions in your RDD
before writing by using repartition
method.
The default partition is 200.
The suggested number of partition is number of partitions = number of cores * 4
Upvotes: 1