Do parquet files preserve the row order of Spark DataFrames?

Question

When I save a Spark DataFrame as a parquet file then read it back, the rows of the resulting DataFrame are not the same as the original as shown in the session below. Is this a "feature" of DataFrames or of parquet files? What would be the best way to save a DataFrame in a row-order preserving manner?

>>> import numpy as np
>>> import pandas as pd
>>> pdf = pd.DataFrame(np.random.random((10,2)))
>>> pdf
          0         1
0  0.191519  0.622109
1  0.437728  0.785359
2  0.779976  0.272593
3  0.276464  0.801872
4  0.958139  0.875933
5  0.357817  0.500995
6  0.683463  0.712702
7  0.370251  0.561196
8  0.503083  0.013768
9  0.772827  0.882641
>>> df = sqlContext.createDataFrame(pdf)
>>> df.show()
+-------------------+--------------------+
|                  0|                   1|
+-------------------+--------------------+
| 0.1915194503788923|  0.6221087710398319|
| 0.4377277390071145|  0.7853585837137692|
| 0.7799758081188035|  0.2725926052826416|
| 0.2764642551430967|  0.8018721775350193|
| 0.9581393536837052|  0.8759326347420947|
|0.35781726995786667|  0.5009951255234587|
| 0.6834629351721363|  0.7127020269829002|
|0.37025075479039493|  0.5611961860656249|
| 0.5030831653078097|0.013768449590682241|
|  0.772826621612374|  0.8826411906361166|
+-------------------+--------------------+
>>> df.write.parquet('test.parquet')
>>> df2 = sqlContext.read.parquet('test.parquet')
>>> df2.show()
+-------------------+--------------------+
|                  0|                   1|
+-------------------+--------------------+
| 0.6834629351721363|  0.7127020269829002|
|0.37025075479039493|  0.5611961860656249|
| 0.5030831653078097|0.013768449590682241|
|  0.772826621612374|  0.8826411906361166|
| 0.7799758081188035|  0.2725926052826416|
| 0.2764642551430967|  0.8018721775350193|
| 0.1915194503788923|  0.6221087710398319|
| 0.4377277390071145|  0.7853585837137692|
| 0.9581393536837052|  0.8759326347420947|
|0.35781726995786667|  0.5009951255234587|
+-------------------+--------------------+

Rohan Aletty · Accepted Answer

This looks like it's the result of partitioning within Spark (as well as the implementation for show()). The function show() essentially wraps some pretty formatting around a call to take() and there is a good explanation as to how take works here. Since the initially read partitions may not be the same across both calls to show(), you will see different values.

Do parquet files preserve the row order of Spark DataFrames?

Answers (1)

Related Questions