Why the order of a written dataframe is not kept after reading it?

Question

I am writing a dataframe to HDFS ordering by the first two columns:

final = .select('Pais','Anho','NumPatentes','TotalCitas','MediaCitas','MaxCitas').orderBy("Pais", "Anho")

final.show()
final.write.format("csv").save("", header = 'true')

Then I am reading it from the HDFS using:

a = (spark \
.read \
.option("inferSchema", "true") \
.option("header", "true") \
.csv("")).show()

However, the output of the first show() is not equal to the second one. I mean, first dataframe shown is ordered and second one is not. These are the first and the second dataframes:

+-------------+----+-----------+----------+----------+--------+
|         Pais|Anho|NumPatentes|TotalCitas|MediaCitas|MaxCitas|
+-------------+----+-----------+----------+----------+--------+
|        Italy|1970|          1|         3|       3.0|       3|
|        Italy|1980|          2|         3|       1.5|       2|
|        Italy|1983|          2|         4|       2.0|       2|
|United States|1978|          1|         1|       1.0|       1|
+-------------+----+-----------+----------+----------+--------+
+-------------+----+-----------+----------+----------+--------+
|         Pais|Anho|NumPatentes|TotalCitas|MediaCitas|MaxCitas|
+-------------+----+-----------+----------+----------+--------+
|United States|1978|          1|         1|       1.0|       1|
|        Italy|1980|          2|         3|       1.5|       2|
|        Italy|1970|          1|         3|       3.0|       3|
|        Italy|1983|          2|         4|       2.0|       2|
+-------------+----+-----------+----------+----------+--------+

It seems like the written dataframe is saved unordered. How can I solve this? How can I save it ordered?

Adam Dukkon · Accepted Answer

Saving ordered dataframe in Spark

Do Spark/Parquet partitions maintain ordering?

based on these answers, you cannot maintain order during the write of files, since the partitions are written, and read separately and spark only guarantee sorting inside a partition.

Why the order of a written dataframe is not kept after reading it?

Answers (1)

Related Questions