Reputation: 119
I have the dataframe below:
customer_id | person_id | type_person | type_person2 | insert_date2 | anterior_type | update_date |
---|---|---|---|---|---|---|
abcdefghijklmnopqrst | 4a5ae8a5-6682-467 | Online | Online | 2022-03-02 | null | null |
abcdefghijklmnopqrst | 1be8d3e8-8075-438 | Online | Online | 2022-03-02 | null | null |
abcdefghijklmnopqrst | 6912dadc-1692-4bd | Online | Offline | 2022-03-02 | Online | 2022-03-03 |
abcdefghijklmnopqrst | e48cba37-113c-4bd | Online | Online | 2022-03-02 | null | null |
abcdefghijklmnopqrst | 831cb669-b2ae-4e8 | Online | Online | 2022-03-02 | null | null |
abcdefghijklmnopqrst | 69161fe5-62ac-400 | Online | Online | 2022-03-02 | null | null |
abcdefghijklmnopqrst | b48b59a0-92eb-410 | Online | Online | 2022-03-02 | null | null |
I need to look at the ´type_person´ and ´type_person2´ columns and create a new column with the following rules:
How do I do this?
Upvotes: 0
Views: 101
Reputation: 361
Use case when statement.
You have two options to do so.
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.when.html?highlight=when#pyspark.sql.functions.when
)Let's do it using the 2nd way:
import pyspark.sql.functions as F
(
DF
.withColumn('rule_result',
F.when(F.col("type_person") == 'online' & F.col("type_person2") == 'online', 'online')
.when(F.col("type_person") == 'offline' & F.col("type_person2") == 'online', 'offline')
.when(F.col("type_person") == 'offline' & F.col("type_person2") == 'online', 'hybrid')
.when(F.col("type_person") == 'online' & F.col("type_person2") == 'offline', 'hybrid')
.when(F.col("type_person") == 'hybrid' | F.col("type_person2") == 'hybrid', 'hybrid')
.otherwise(None)
)
Upvotes: 1