Reputation: 5480
I have a data frame like below in pyspark
.
+-----+---+---+----+
|tests|val|asd|cnty|
+-----+---+---+----+
|test1| Y| 1|null|
|test2| N| 2| UK|
| null| Y| 1| UK|
|test1| N| 2|null|
| null| N| 3|null|
|test3| N| 4| AUS|
|test4| Y| 5|null|
+-----+---+---+----+
I want to update the val
column when the any given tests
has val Y
then all val's
of that particular tests
should be updated to Y
. if not then what ever values they have. I want to exclude the records where tests
column has null
values.
I have done like below
from pyspark.sql import Window
import pyspark.sql.functions as f
df1 = df.select('tests', f.max('val').over(Window.partitionBy('tests')).alias('val'), 'asd', 'cnty')
I am getting result like below
+-----+---+---+----+
|tests|val|asd|cnty|
+-----+---+---+----+
|test1| Y| 1|null|
|test1| Y| 2|null|
|test2| N| 2| UK|
|test3| N| 4| AUS|
|test4| Y| 5|null|
| null| Y| 1| UK|
| null| Y| 3|null|
+-----+---+---+----+
I want the result to be like below
+-----+---+---+----+
|tests|val|asd|cnty|
+-----+---+---+----+
|test1| Y| 1|null|
|test1| Y| 2|null|
|test2| N| 2| UK|
|test3| N| 4| AUS|
|test4| Y| 5|null|
| null| Y| 1| UK|
| null| N| 3|null|
+-----+---+---+----+
Upvotes: 1
Views: 5179
Reputation: 41957
All you needed was an additional when
condition checking
from pyspark.sql import Window
import pyspark.sql.functions as f
df1 = df.select('tests', f.when((f.isnull(f.col('tests'))) | (f.col('tests') == 'null'), f.col('val')).otherwise(f.max('val').over(Window.partitionBy('tests'))).alias('val'), 'asd', 'cnty')
which would give you
+-----+---+---+----+
|tests|val|asd|cnty|
+-----+---+---+----+
|test4| Y| 5|null|
|test3| N| 4| AUS|
|test1| Y| 1|null|
|test1| Y| 2|null|
|test2| N| 2| UK|
| null| Y| 1| UK|
| null| N| 3|null|
+-----+---+---+----+
I hope the answer is helpful
Upvotes: 1