Reputation: 401
I am writing a dataframe to HDFS ordering by the first two columns:
final = <dataframe>.select('Pais','Anho','NumPatentes','TotalCitas','MediaCitas','MaxCitas').orderBy("Pais", "Anho")
final.show()
final.write.format("csv").save("<path>", header = 'true')
Then I am reading it from the HDFS using:
a = (spark \
.read \
.option("inferSchema", "true") \
.option("header", "true") \
.csv("<path>")).show()
However, the output of the first show()
is not equal to the second one. I mean, first dataframe shown is ordered and second one is not. These are the first and the second dataframes:
+-------------+----+-----------+----------+----------+--------+
| Pais|Anho|NumPatentes|TotalCitas|MediaCitas|MaxCitas|
+-------------+----+-----------+----------+----------+--------+
| Italy|1970| 1| 3| 3.0| 3|
| Italy|1980| 2| 3| 1.5| 2|
| Italy|1983| 2| 4| 2.0| 2|
|United States|1978| 1| 1| 1.0| 1|
+-------------+----+-----------+----------+----------+--------+
+-------------+----+-----------+----------+----------+--------+
| Pais|Anho|NumPatentes|TotalCitas|MediaCitas|MaxCitas|
+-------------+----+-----------+----------+----------+--------+
|United States|1978| 1| 1| 1.0| 1|
| Italy|1980| 2| 3| 1.5| 2|
| Italy|1970| 1| 3| 3.0| 3|
| Italy|1983| 2| 4| 2.0| 2|
+-------------+----+-----------+----------+----------+--------+
It seems like the written dataframe is saved unordered. How can I solve this? How can I save it ordered?
Upvotes: 2
Views: 2760
Reputation: 293
Saving ordered dataframe in Spark
Do Spark/Parquet partitions maintain ordering?
based on these answers, you cannot maintain order during the write of files, since the partitions are written, and read separately and spark only guarantee sorting inside a partition.
Upvotes: 3