Reputation: 37
I am trying to update an existing column in pyspark but seems like old values in the column are getting updated as well despite no otherwise condition
+-----+-----+-----+-----+-----+----+ |cntry|cde_1|cde_2|rsn_1|rsn_2|FLAG| +-----+-----+-----+-----+-----+----+ | MY| A| | 1| 2|null| | MY| G| | 1| 2|null| | MY| | G| 1| 2|null| | TH| A| | 16| 2|null| | TH| B| | 1| 16| 1| | TH| | W| 16| 2| 1| +-----+-----+-----+-----+-----+----+
df = sc.parallelize([ ["MY","A","","1","2"], ["MY","G","","1","2"], ["MY","","G","1","2"], ["TH","A","","16","2"], ["TH","B","","1","16"], ["TH","","W","16","2"] ]).toDF(("cntry", "cde_1", "cde_2", "rsn_1", "rsn_2"))
df = df.withColumn('FLAG', F.when( (df.cntry == "MY") & ( (df.cde_1.isin("G") ) | (df.cde_2.isin("G") ) ) & ( (df.rsn_1 == "1") | (df.rsn_2 == "1") ) , 1))
df = df.withColumn('FLAG', F.when( (df.cntry == "TH") & ( (df.cde_1.isin("B", "W") ) | (df.cde_2.isin("B", "W") ) ) & ( (df.rsn_1 == "16") | (df.rsn_2 == "16") ) , 1))
Upvotes: 1
Views: 58
Reputation: 13459
You need to combine your conditions using the Boolean OR. Like this:
df = sc.parallelize([ ["MY","A","","1","2"], ["MY","G","","1","2"], ["MY","","G","1","2"], ["TH","A","","16","2"], ["TH","B","","1","16"], ["TH","","W","16","2"] ]).toDF(("cntry", "cde_1", "cde_2", "rsn_1", "rsn_2"))
cond1 = (df.cntry == "MY") & ( (df.cde_1.isin("G") ) | (df.cde_2.isin("G") ) ) & ( (df.rsn_1 == "1") | (df.rsn_2 == "1") )
cond2 = (df.cntry == "TH") & ( (df.cde_1.isin("B", "W") ) | (df.cde_2.isin("B", "W") ) ) & ( (df.rsn_1 == "16") | (df.rsn_2 == "16") )
df.withColumn("FLAG", F.when(cond1 | cond2, 1)).show()
In your last line, you overwrite the FLAG
column, as you’re not referencing its previous state. That’s why the previous values are not preserved.
Instead of combining the expressions, you could also use when(cond1, 1).otherwise(when(cond2, 1))
. That’s a stylistic choice.
Upvotes: 1