Sergio6Rey
Sergio6Rey

Reputation: 401

Why the order of a written dataframe is not kept after reading it?

I am writing a dataframe to HDFS ordering by the first two columns:

final = <dataframe>.select('Pais','Anho','NumPatentes','TotalCitas','MediaCitas','MaxCitas').orderBy("Pais", "Anho")

final.show()
final.write.format("csv").save("<path>", header = 'true')

Then I am reading it from the HDFS using:

a = (spark \
.read \
.option("inferSchema", "true") \
.option("header", "true") \
.csv("<path>")).show()

However, the output of the first show() is not equal to the second one. I mean, first dataframe shown is ordered and second one is not. These are the first and the second dataframes:

+-------------+----+-----------+----------+----------+--------+
|         Pais|Anho|NumPatentes|TotalCitas|MediaCitas|MaxCitas|
+-------------+----+-----------+----------+----------+--------+
|        Italy|1970|          1|         3|       3.0|       3|
|        Italy|1980|          2|         3|       1.5|       2|
|        Italy|1983|          2|         4|       2.0|       2|
|United States|1978|          1|         1|       1.0|       1|
+-------------+----+-----------+----------+----------+--------+
+-------------+----+-----------+----------+----------+--------+
|         Pais|Anho|NumPatentes|TotalCitas|MediaCitas|MaxCitas|
+-------------+----+-----------+----------+----------+--------+
|United States|1978|          1|         1|       1.0|       1|
|        Italy|1980|          2|         3|       1.5|       2|
|        Italy|1970|          1|         3|       3.0|       3|
|        Italy|1983|          2|         4|       2.0|       2|
+-------------+----+-----------+----------+----------+--------+

It seems like the written dataframe is saved unordered. How can I solve this? How can I save it ordered?

Upvotes: 2

Views: 2760

Answers (1)

Adam Dukkon
Adam Dukkon

Reputation: 293

Saving ordered dataframe in Spark

Do Spark/Parquet partitions maintain ordering?

based on these answers, you cannot maintain order during the write of files, since the partitions are written, and read separately and spark only guarantee sorting inside a partition.

Upvotes: 3

Related Questions