Reputation: 55

Empty Files in output spark

I am writing my dataframe like below

df.write().format("com.databricks.spark.avro").save("path");

However I am getting around 200 files where around 30-40 files are empty.I can understand that it might be due to empty partitions. I then updated my code like

df.coalesce(50).write().format("com.databricks.spark.avro").save("path");

But I feel it might impact performance. Is there any other better approach to limit number of output files and remove empty files

Upvotes: 0

Answers (3)

Raktotpal Bordoloi

Reputation: 1057

As default no. of RDD partitions is 200; you have to do shuffle to remove skewed partitions.

You can either use repartition method on the RDD; or make use of DISTRIBUTE BY clause on dataframe - which will repartition along with distributing data among partitions evenly.

def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]

Returns dataset instance with proper partitions.

You may use repartitionAndSortWithinPartitions - which can improve compression ratio.

Upvotes: 1

rogue-one

Reputation: 11587

repartition your dataframe using this method. To eliminate skew and ensure even distribution of data choose column(s) in your dataframe with high cardinality (having unique number of values in the columns) for the partitionExprs argument to ensure even distribution.

Upvotes: 1

koiralo

Reputation: 23109

You can remove the empty partitions in your RDD before writing by using repartition method.

The default partition is 200.

The suggested number of partition is number of partitions = number of cores * 4

Upvotes: 1

Empty Files in output spark

Answers (3)

Related Questions