Reputation: 595
Let's say I have the following Pyspark dataframe:
Country Direction Quantity Price
Belgium In 5 10
Belgium Out 2 8
Belgium Out 3 9
France In 2 3
France Out 3 2
France Out 4 3
Is it possible to groupby this dataframe by column "Country", aggregate average of the "Price" column as normal, but use function "first" for "Quantity" column, only for rows when "Direction" column is "Out"? I imagine it should be something like this:
df.groupby("Country").agg(F.mean('Price'), F.first(F.col('Quantity').filter(F.col('Direction') == "Out")))
Upvotes: 1
Views: 1074
Reputation: 42332
You can mask the Quantity
for Direction != 'out'
and do a first
with ignoreNulls
:
df.groupby("Country").agg(
F.mean('Price'),
F.first(
F.when(
F.col('Direction') == "Out",
F.col('Quantity')
),
ignoreNulls=True
)
)
Upvotes: 3