Pyspark read multiple csv files into a dataframe in order

Question

When I try to read a fold containing multiple CSV files by pyspark(2.2.1) into a dataframe, the data records are in an unexpected order. The data folder is created by another Spark program, and the files are like

/path/part-00000-*
/path/part-00001-*
......

and each file contains only one record. Some of the records have null values in some columns.

Records should be ordered by one column, and I'm sure that the files are in the right order, i.e., part-00000-* contains the first record, part-00001-* contains the second record...

However, when I read the data into dataframe by pyspark:

df = SQLContext(sc).read.format('csv')
    .option('header', 'true')
    .option('mode', 'DROPMALFORMED')
    .load('/path')

the order has been changed(the data should be ordered by column timestamp). I notice that the records on the top do not have null values:

+--------------------+-----------+--------------+--------------+
|                time|  timestamp|         attr1|         attr2|
+--------------------+-----------+--------------+--------------+
|2018-09-30 21:33:...| 1538314433| 1538314433000| 1538314433000|
|2018-09-30 21:35:...| 1538314544| 1538314544000| 1538314544000|
|2018-09-30 21:38:...| 1538314682| 1538314682000| 1538314682000|
|2018-09-30 21:38:...| 1538314734| 1538314734000| 1538314734000|
|2018-09-30 21:25:...| 1538313912|          null| 1538313912000|
|2018-09-30 21:25:...| 1538313913|          null| 1538313913000|
|2018-09-30 21:25:...| 1538313914|          null| 1538313914000|
|2018-09-30 21:25:...| 1538313915|          null| 1538313915000|
|2018-09-30 21:25:...| 1538313932|          null| 1538313932000|
|2018-09-30 21:25:...| 1538313934| 1538313934000|          null|
|2018-09-30 21:25:...| 1538313942|          null| 1538313942000|
|2018-09-30 21:25:...| 1538313943|          null| 1538313943000|
|2018-09-30 21:26:...| 1538314007|          null| 1538314007000|
|2018-09-30 21:27:...| 1538314026| 1538314026000|          null|
|2018-09-30 21:27:...| 1538314028|          null| 1538314028000|
|2018-09-30 21:27:...| 1538314029|          null| 1538314029000|
|2018-09-30 21:27:...| 1538314043| 1538314043000|          null|
|2018-09-30 21:27:...| 1538314064| 1538314064000|          null|
|2018-09-30 21:27:...| 1538314067| 1538314067000|          null|

I'm wondering why this happened and how can I load a dataframe with right order.

Steven · Accepted Answer

If you want to order by timestamp, just add the orderBy clause:

df.orderBy('timestamp').show()

Pyspark read multiple csv files into a dataframe in order

Answers (1)

Related Questions