MichiganMagician
MichiganMagician

Reputation: 273

Pyspark. Create A new column based on multiple columns and multiple rows

I have a dataframe below:

id feature1 feature2 feature3 feature4
1 a 0 0 0
1 b 0 0 0
1 c 0 0 0
2 a 1 0 0
2 b 0 0 0
2 c 0 0 0
3 b 0 0 0
3 c 0 1 0

I need to get a new column and get the dataframe below:

id feature1 feature2 feature3 feature4 NewColumn
1 a 0 0 0 1
1 b 0 0 0 1
1 c 0 0 0 1
2 a 1 0 0 0
2 b 0 0 0 0
2 c 0 0 0 0
3 b 0 0 0 0
3 c 0 1 0 0

For the case when id=1, newColumn should be 1 because all rows with feature2 , feature3 and feature4 are 0. However, for the case when id==2 NewColumn should be 0 because feature2 is 1 when feature1=a in this case. Same when id==3, NewColumn should be 0 because feature3 contains 1. Basically, I need to consider all ids individually and check feature2, feature3 and feature4 columns if they contain 1. How to implement it efficiently in Pyspark?

Upvotes: 1

Views: 1172

Answers (1)

AdibP
AdibP

Reputation: 2939

Since it needs to check values across rows, you can use collect_list of array of feature2, feature3 and feature4 over window partitioned by column id, flatten the array, check if the array contains any value of 1 using exists, then cast the negated boolean value to integer. (for Spark version >= 2.4)

from pyspark.sql.functions import expr

df = df.withColumn('NewColumn', 
                   expr("""cast(
                             not exists(
                               flatten(
                                 collect_list(array(feature2,feature3,feature4)) over (partition by id)),
                               v -> v = 1)
                             as int)"""))
df.show()

# +---+--------+--------+--------+--------+---------+
# | id|feature1|feature2|feature3|feature4|NewColumn|
# +---+--------+--------+--------+--------+---------+
# |  1|       a|       0|       0|       0|        1|
# |  1|       b|       0|       0|       0|        1|
# |  1|       c|       0|       0|       0|        1|
# |  3|       b|       0|       0|       0|        0|
# |  3|       c|       0|       1|       0|        0|
# |  2|       a|       1|       0|       0|        0|
# |  2|       b|       0|       0|       0|        0|
# |  2|       c|       0|       0|       0|        0|
# +---+--------+--------+--------+--------+---------+

Upvotes: 1

Related Questions