Reputation: 30825
I use coalesce(1) to write a Dataframe to single file, like this.
df.coalesce(1).write.format("csv")
.option("header", true).mode("overwrite").save(output_path)
A quick glance at the file shows that the order was preserved, but is it always the case? If the order is not preserved, how can I enforce it? The coalesce function of RDD has an extra parameter to disallow shuffling, but the coalesce method of Dataframe only takes 1 parameter.
Upvotes: 2
Views: 2150
Reputation: 4045
If you read a file (sc.read.text
) the lines of the DataFrame/Dataset/RDD
will be in the order that they were in the file.
list, map, filter,coalesce and flatMap
do preserve the order.
sortBy, partitionBy and join
do not preserve the order.
The reason is that most DataFrame/Dataset/RDD
operations work on Iterators inside the partitions. So map or filter just has no way to mess up the order.
In case of if you choose to use HashPartitioner
and invoking invoke map
on DataFrame/Dataset/RDD
will change the key. In this case you can use partitionBy
to restore the partitioning with a shuffle.
Upvotes: 2