Reputation: 273
I have a dataframe below:
id | feature1 | feature2 | feature3 | feature4 |
---|---|---|---|---|
1 | a | 0 | 0 | 0 |
1 | b | 0 | 0 | 0 |
1 | c | 0 | 0 | 0 |
2 | a | 1 | 0 | 0 |
2 | b | 0 | 0 | 0 |
2 | c | 0 | 0 | 0 |
3 | b | 0 | 0 | 0 |
3 | c | 0 | 1 | 0 |
I need to get a new column and get the dataframe below:
id | feature1 | feature2 | feature3 | feature4 | NewColumn |
---|---|---|---|---|---|
1 | a | 0 | 0 | 0 | 1 |
1 | b | 0 | 0 | 0 | 1 |
1 | c | 0 | 0 | 0 | 1 |
2 | a | 1 | 0 | 0 | 0 |
2 | b | 0 | 0 | 0 | 0 |
2 | c | 0 | 0 | 0 | 0 |
3 | b | 0 | 0 | 0 | 0 |
3 | c | 0 | 1 | 0 | 0 |
For the case when id=1
, newColumn should be 1 because all rows with feature2
, feature3
and feature4
are 0. However, for the case when id==2
NewColumn should be 0 because feature2
is 1 when feature1=a
in this case. Same when id==3
, NewColumn should be 0 because feature3
contains 1. Basically, I need to consider all ids individually and check feature2
, feature3
and feature4
columns if they contain 1. How to implement it efficiently in Pyspark?
Upvotes: 1
Views: 1172
Reputation: 2939
Since it needs to check values across rows, you can use collect_list
of array
of feature2, feature3 and feature4 over window partitioned by column id, flatten
the array, check if the array contains any value of 1
using exists
, then cast
the negated boolean value to integer. (for Spark version >= 2.4)
from pyspark.sql.functions import expr
df = df.withColumn('NewColumn',
expr("""cast(
not exists(
flatten(
collect_list(array(feature2,feature3,feature4)) over (partition by id)),
v -> v = 1)
as int)"""))
df.show()
# +---+--------+--------+--------+--------+---------+
# | id|feature1|feature2|feature3|feature4|NewColumn|
# +---+--------+--------+--------+--------+---------+
# | 1| a| 0| 0| 0| 1|
# | 1| b| 0| 0| 0| 1|
# | 1| c| 0| 0| 0| 1|
# | 3| b| 0| 0| 0| 0|
# | 3| c| 0| 1| 0| 0|
# | 2| a| 1| 0| 0| 0|
# | 2| b| 0| 0| 0| 0|
# | 2| c| 0| 0| 0| 0|
# +---+--------+--------+--------+--------+---------+
Upvotes: 1