Reputation: 87

Order of rows shown changes on selection of columns from dependent pyspark dataframe

Why does the order of rows displayed differ, when I take a subset of the dataframe columns to display, via show?

Here is the original dataframe:

Here dates are in the given order, as you can see, via show.

Now the order of rows displayed via show changes when I select a subset of predict_df by method of column selection for a new dataframe.

Upvotes: 4

Answers (3)

Ged

Reputation: 18098

The situation occurs because the show is an action that is called twice.

As no .cache is applied the whole cycle starts again from the start. Moreover, I tried this a few times and got the same order and not the same order as the questioner observed. Processing is non-deterministic.

As soon as I used .cache, the same result was always gotten.

This means that there is ordering preserved over a narrow transformation on a dataframe, if caching has been applied, otherwise the 2nd action will invoke processing from the start again - the basics are evident here as well. And may be the bottom line is always do ordering explicitly - if it matters.

Upvotes: 1

Ihor Konovalenko

Reputation: 1407

Because of Spark dataframe itself is unordered. It's due to parallel processing principles wich Spark uses. Different records may be located in different files (and on different nodes) and different executors may read the data in different time and in different sequence.

So You have to excplicitly specify order in Spark action using orderBy (or sort) method. E.g.:

df.orderBy('date').show()

In this case result will be ordered by date column and would be more predictible. But, if many records have equal date value then within those date subset records also would be unordered. So in this case, in order to obtain strongly ordered data, we have to perform orderBy on set of columns. And values in all rows of those set of columns must be unique. E.g.:

df.orderBy(col("date").asc, col("other_column").desc)

In general unordered datasets is a normal case for data processing systems. Even "traditional" DBMS like PostgeSQL or MS SQL Server in general return unordered records and we have to explicitly use ORDER BY clause in SELECT statement. And even if sometime we may see the same results of one query it isn't guarenteed by DBMS that by another execution result will be the same also. Especially if data reading is performed on a large amout of data.

Upvotes: 4

pltc

Reputation: 6082

Like @Ihor Konovalenko and @mck mentioned, Sprk dataframe is unordered by its nature. Also, looks like your dataframe doesn’t have a reliable key to order, so one solution is using monotonically_increasing_id https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.monotonically_increasing_id.html to create id and that will keep your dataframe always ordered. However if your dataframe is big, be aware this function might take some time to generate id for each row.

Upvotes: 0

Order of rows shown changes on selection of columns from dependent pyspark dataframe

Answers (3)

Related Questions