Logical with Pyspark with When

Question

I have the dataframe below:

customer_id	person_id	type_person	type_person2	insert_date2	anterior_type	update_date
abcdefghijklmnopqrst	4a5ae8a5-6682-467	Online	Online	2022-03-02	null	null
abcdefghijklmnopqrst	1be8d3e8-8075-438	Online	Online	2022-03-02	null	null
abcdefghijklmnopqrst	6912dadc-1692-4bd	Online	Offline	2022-03-02	Online	2022-03-03
abcdefghijklmnopqrst	e48cba37-113c-4bd	Online	Online	2022-03-02	null	null
abcdefghijklmnopqrst	831cb669-b2ae-4e8	Online	Online	2022-03-02	null	null
abcdefghijklmnopqrst	69161fe5-62ac-400	Online	Online	2022-03-02	null	null
abcdefghijklmnopqrst	b48b59a0-92eb-410	Online	Online	2022-03-02	null	null

I need to look at the ´type_person´ and ´type_person2´ columns and create a new column with the following rules:

If both are online then online
If both are offline then offline
If one is offline and one is online then hybrid
If one is online and one is offline then hybrid
If either of the two is hybrid then hybrid

How do I do this?

Benny Elgazar · Accepted Answer

Use case when statement.

You have two options to do so.

Using SparkSQL
DataFrame Manipulation. (Ref: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.when.html?highlight=when#pyspark.sql.functions.when)

Let's do it using the 2nd way:

import pyspark.sql.functions as F
(
DF
.withColumn('rule_result', 
   F.when(F.col("type_person") == 'online' & F.col("type_person2") == 'online', 'online')
   .when(F.col("type_person") == 'offline' & F.col("type_person2") == 'online', 'offline')
   .when(F.col("type_person") == 'offline' & F.col("type_person2") == 'online', 'hybrid')
   .when(F.col("type_person") == 'online' & F.col("type_person2") == 'offline', 'hybrid')
   .when(F.col("type_person") == 'hybrid' | F.col("type_person2") == 'hybrid', 'hybrid')
 .otherwise(None)
)

Logical with Pyspark with When

Answers (1)

Related Questions