Caroline Leite
Caroline Leite

Reputation: 119

Logical with count in Pyspark

I'm new to Pyspark and I have a problem to solve.

I have a dataframe with 4 columns, being customers, person, is_online_store and count:

customer PersonId is_online_store count
afabd2d2 4 true 1
afabd2d2 8 true 2
afabd2d2 3 true 1
afabd2d2 2 false 1
afabd2d2 4 false 1

I need to create according to the following rules:

How it should look:

customer PersonId New_Column
afabd2d2 4 Hybrid
afabd2d2 8 Online
afabd2d2 3 Online
afabd2d2 2 Offline

Upvotes: 1

Views: 481

Answers (1)

blackbishop
blackbishop

Reputation: 32700

Group by customer + personId and use when expression to check if all values in is_online_store column are true / false or a mix of the 2, using for example bool_and function:

from pyspark.sql import functions as F

df1 = df.groupBy("customer", "PersonId").agg(
    F.when(F.expr("bool_and(is_online_store)"), "Online")
     .when(F.expr("bool_and(!is_online_store)"), "Offline")
     .otherwise("Hybrid").alias("New_Column")
)

df1.show()
#+--------+--------+----------+
#|customer|PersonId|New_Column|
#+--------+--------+----------+
#|afabd2d2|       2|   Offline|
#|afabd2d2|       8|    Online|
#|afabd2d2|       4|    Hybrid|
#|afabd2d2|       3|    Online|
#+--------+--------+----------+

Upvotes: 1

Related Questions