Reputation: 83
I ran some test cases from a spark shell . The statement that i executed were of the form .
read.orderBy($"p_int".asc ).write.format("com.databricks.spark.csv").save(“file:///tmp/output.txt”)
The content in the output directory seems to always be sorted. however I cannot find any documentation in spark that even related to any guarantees provided by either the DataFrameWriter in terms of preserving partition order or row order.
The question is can i always expect the data in the target file to be sorted ?and please add any link to proper documentation.
Upvotes: 0
Views: 855
Reputation: 147
If you coalesce to 1 partition before saving, the output will be sorted. Be careful thought, when reading back the .csv in spark, if in your spark config spark.default.parallelism
is more than 1, ordering will be lost.
Upvotes: 1