Reputation: 1415
Given a pyspark dataframe df
with columns 'ProductId', 'Date' and 'Price', how safe is to sort by 'Date' and assume that func.first('Price')
will always retrieve the Price corresponding to the minimum date?
I mean: will
df.orderBy('ProductId', 'Date').groupBy('ProductId').agg(func.first('Price'))
return for each product the first price paid in time without messing with the orderBy
while grouping?
Upvotes: 2
Views: 2689
Reputation: 43534
I am not sure if the order is guaranteed to be maintained for the groupBy()
. However, here is alternative way to do what you want that will work.
Use pyspark.sql.Window
to partition and order the DataFrame as desired. Then use pyspark.sql.DataFrame.distinct()
to drop the duplicate entries.
For example:
Create Dummy Data
data = [
(123, '2017-07-01', 50),
(123, '2017-01-01', 100),
(345, '2018-01-01', 20),
(123, '2017-03-01', 25),
(345, '2018-02-01', 33)
]
df = sqlCtx.createDataFrame(data, ['ProductId', 'Date', 'Price'])
df.show()
#+---------+----------+-----+
#|ProductId| Date|Price|
#+---------+----------+-----+
#| 123|2017-07-01| 50|
#| 123|2017-01-01| 100|
#| 345|2018-01-01| 20|
#| 123|2017-03-01| 25|
#| 345|2018-02-01| 33|
#+---------+----------+-----+
Use Window
Use Window.partitionBy('ProductId').orderBy('Date')
:
import pyspark.sql.functions as f
from pyspark.sql import Window
df.select(
'ProductId',
f.first('Price').over(Window.partitionBy('ProductId').orderBy('Date')).alias('Price')
).distinct().show()
#+---------+-----+
#|ProductId|Price|
#+---------+-----+
#| 123| 100|
#| 345| 20|
#+---------+-----+
Edit
I found this scala post in which the accepted answer says that the order is preserved, though there is a discussion in the comments that contradicts that.
Upvotes: 1