Aggregate a column on rows with condition on another column using groupby

Question

Let's say I have the following Pyspark dataframe:

 Country    Direction    Quantity     Price
 Belgium    In           5            10
 Belgium    Out          2            8
 Belgium    Out          3            9
 France     In           2            3
 France     Out          3            2
 France     Out          4            3

Is it possible to groupby this dataframe by column "Country", aggregate average of the "Price" column as normal, but use function "first" for "Quantity" column, only for rows when "Direction" column is "Out"? I imagine it should be something like this:

df.groupby("Country").agg(F.mean('Price'), F.first(F.col('Quantity').filter(F.col('Direction') == "Out")))

mck · Accepted Answer

You can mask the Quantity for Direction != 'out' and do a first with ignoreNulls:

df.groupby("Country").agg(
    F.mean('Price'),
    F.first(
        F.when(
            F.col('Direction') == "Out",
            F.col('Quantity')
        ),
        ignoreNulls=True
    )
)

Aggregate a column on rows with condition on another column using groupby

Answers (1)

Related Questions