foebu
foebu

Reputation: 1415

Is it safe to sort before grouping a dataframe in pyspark?

Given a pyspark dataframe df with columns 'ProductId', 'Date' and 'Price', how safe is to sort by 'Date' and assume that func.first('Price') will always retrieve the Price corresponding to the minimum date?

I mean: will
df.orderBy('ProductId', 'Date').groupBy('ProductId').agg(func.first('Price'))
return for each product the first price paid in time without messing with the orderBy while grouping?

Upvotes: 2

Views: 2689

Answers (1)

pault
pault

Reputation: 43534

I am not sure if the order is guaranteed to be maintained for the groupBy(). However, here is alternative way to do what you want that will work.

Use pyspark.sql.Window to partition and order the DataFrame as desired. Then use pyspark.sql.DataFrame.distinct() to drop the duplicate entries.

For example:

Create Dummy Data

data = [
    (123, '2017-07-01', 50),
    (123, '2017-01-01', 100),
    (345, '2018-01-01', 20),
    (123, '2017-03-01', 25),
    (345, '2018-02-01', 33)
]

df = sqlCtx.createDataFrame(data, ['ProductId', 'Date', 'Price'])
df.show()
#+---------+----------+-----+
#|ProductId|      Date|Price|
#+---------+----------+-----+
#|      123|2017-07-01|   50|
#|      123|2017-01-01|  100|
#|      345|2018-01-01|   20|
#|      123|2017-03-01|   25|
#|      345|2018-02-01|   33|
#+---------+----------+-----+

Use Window

Use Window.partitionBy('ProductId').orderBy('Date'):

import pyspark.sql.functions as f
from pyspark.sql import Window

df.select(
    'ProductId',
    f.first('Price').over(Window.partitionBy('ProductId').orderBy('Date')).alias('Price')
).distinct().show()
#+---------+-----+
#|ProductId|Price|
#+---------+-----+
#|      123|  100|
#|      345|   20|
#+---------+-----+

Edit

I found this scala post in which the accepted answer says that the order is preserved, though there is a discussion in the comments that contradicts that.

Upvotes: 1

Related Questions