JavaPlanet
JavaPlanet

Reputation: 83

Spark Dataframe order preservation .Does calling the save operation on orderBy dataframe preserves ordering

I ran some test cases from a spark shell . The statement that i executed were of the form .

read.orderBy($"p_int".asc ).write.format("com.databricks.spark.csv").save(“file:///tmp/output.txt”)

The content in the output directory seems to always be sorted. however I cannot find any documentation in spark that even related to any guarantees provided by either the DataFrameWriter in terms of preserving partition order or row order.

The question is can i always expect the data in the target file to be sorted ?and please add any link to proper documentation.

Upvotes: 0

Views: 855

Answers (1)

nikos
nikos

Reputation: 147

If you coalesce to 1 partition before saving, the output will be sorted. Be careful thought, when reading back the .csv in spark, if in your spark config spark.default.parallelism is more than 1, ordering will be lost.

Upvotes: 1

Related Questions