Caroline Leite
Caroline Leite

Reputation: 119

Logical with Pyspark with When

I have the dataframe below:

customer_id person_id type_person type_person2 insert_date2 anterior_type update_date
abcdefghijklmnopqrst 4a5ae8a5-6682-467 Online Online 2022-03-02 null null
abcdefghijklmnopqrst 1be8d3e8-8075-438 Online Online 2022-03-02 null null
abcdefghijklmnopqrst 6912dadc-1692-4bd Online Offline 2022-03-02 Online 2022-03-03
abcdefghijklmnopqrst e48cba37-113c-4bd Online Online 2022-03-02 null null
abcdefghijklmnopqrst 831cb669-b2ae-4e8 Online Online 2022-03-02 null null
abcdefghijklmnopqrst 69161fe5-62ac-400 Online Online 2022-03-02 null null
abcdefghijklmnopqrst b48b59a0-92eb-410 Online Online 2022-03-02 null null

I need to look at the ´type_person´ and ´type_person2´ columns and create a new column with the following rules:

How do I do this?

Upvotes: 0

Views: 101

Answers (1)

Benny Elgazar
Benny Elgazar

Reputation: 361

Use case when statement.

You have two options to do so.

  1. Using SparkSQL
  2. DataFrame Manipulation. (Ref: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.when.html?highlight=when#pyspark.sql.functions.when)

Let's do it using the 2nd way:

import pyspark.sql.functions as F
(
DF
.withColumn('rule_result', 
   F.when(F.col("type_person") == 'online' & F.col("type_person2") == 'online', 'online')
   .when(F.col("type_person") == 'offline' & F.col("type_person2") == 'online', 'offline')
   .when(F.col("type_person") == 'offline' & F.col("type_person2") == 'online', 'hybrid')
   .when(F.col("type_person") == 'online' & F.col("type_person2") == 'offline', 'hybrid')
   .when(F.col("type_person") == 'hybrid' | F.col("type_person2") == 'hybrid', 'hybrid')
 .otherwise(None)
)

Upvotes: 1

Related Questions