Laurynas G
Laurynas G

Reputation: 595

Aggregate a column on rows with condition on another column using groupby

Let's say I have the following Pyspark dataframe:

 Country    Direction    Quantity     Price
 Belgium    In           5            10
 Belgium    Out          2            8
 Belgium    Out          3            9
 France     In           2            3
 France     Out          3            2
 France     Out          4            3
 

Is it possible to groupby this dataframe by column "Country", aggregate average of the "Price" column as normal, but use function "first" for "Quantity" column, only for rows when "Direction" column is "Out"? I imagine it should be something like this:

df.groupby("Country").agg(F.mean('Price'), F.first(F.col('Quantity').filter(F.col('Direction') == "Out")))

Upvotes: 1

Views: 1074

Answers (1)

mck
mck

Reputation: 42332

You can mask the Quantity for Direction != 'out' and do a first with ignoreNulls:

df.groupby("Country").agg(
    F.mean('Price'),
    F.first(
        F.when(
            F.col('Direction') == "Out",
            F.col('Quantity')
        ),
        ignoreNulls=True
    )
)

Upvotes: 3

Related Questions