MetallicPriest
MetallicPriest

Reputation: 30825

Does Dataframe coalesce in Spark preserve order?

I use coalesce(1) to write a Dataframe to single file, like this.

df.coalesce(1).write.format("csv")
  .option("header", true).mode("overwrite").save(output_path)

A quick glance at the file shows that the order was preserved, but is it always the case? If the order is not preserved, how can I enforce it? The coalesce function of RDD has an extra parameter to disallow shuffling, but the coalesce method of Dataframe only takes 1 parameter.

Upvotes: 2

Views: 2150

Answers (1)

QuickSilver
QuickSilver

Reputation: 4045

If you read a file (sc.read.text) the lines of the DataFrame/Dataset/RDD will be in the order that they were in the file.

list, map, filter,coalesce and flatMap do preserve the order. sortBy, partitionBy and join do not preserve the order.

The reason is that most DataFrame/Dataset/RDD operations work on Iterators inside the partitions. So map or filter just has no way to mess up the order.

In case of if you choose to use HashPartitioner and invoking invoke map on DataFrame/Dataset/RDD will change the key. In this case you can use partitionBy to restore the partitioning with a shuffle.

Upvotes: 2

Related Questions