spoonman
spoonman

Reputation: 37

update an existing column in pyspark without changing older values

I am trying to update an existing column in pyspark but seems like old values in the column are getting updated as well despite no otherwise condition

+-----+-----+-----+-----+-----+----+
|cntry|cde_1|cde_2|rsn_1|rsn_2|FLAG|
+-----+-----+-----+-----+-----+----+
|   MY|    A|     |    1|    2|null|
|   MY|    G|     |    1|    2|null|
|   MY|     |    G|    1|    2|null|
|   TH|    A|     |   16|    2|null|
|   TH|    B|     |    1|   16|   1|
|   TH|     |    W|   16|    2|   1|
+-----+-----+-----+-----+-----+----+
df = sc.parallelize([ ["MY","A","","1","2"], ["MY","G","","1","2"], ["MY","","G","1","2"],  ["TH","A","","16","2"], ["TH","B","","1","16"], ["TH","","W","16","2"] ]).toDF(("cntry", "cde_1", "cde_2", "rsn_1", "rsn_2"))


df = df.withColumn('FLAG', F.when( (df.cntry == "MY") &  ( (df.cde_1.isin("G") ) |  (df.cde_2.isin("G") ) )   &  ( (df.rsn_1 == "1") | (df.rsn_2 == "1") ) , 1))

df = df.withColumn('FLAG', F.when( (df.cntry == "TH") &  ( (df.cde_1.isin("B", "W") ) |  (df.cde_2.isin("B", "W") ) )  & ( (df.rsn_1 == "16") |  (df.rsn_2 == "16") ) , 1))

Upvotes: 1

Views: 58

Answers (1)

Oliver W.
Oliver W.

Reputation: 13459

You need to combine your conditions using the Boolean OR. Like this:

df = sc.parallelize([ ["MY","A","","1","2"], ["MY","G","","1","2"], ["MY","","G","1","2"], ["TH","A","","16","2"], ["TH","B","","1","16"], ["TH","","W","16","2"] ]).toDF(("cntry", "cde_1", "cde_2", "rsn_1", "rsn_2"))
cond1 = (df.cntry == "MY") & ( (df.cde_1.isin("G") ) | (df.cde_2.isin("G") ) ) & ( (df.rsn_1 == "1") | (df.rsn_2 == "1") )
cond2 = (df.cntry == "TH") & ( (df.cde_1.isin("B", "W") ) | (df.cde_2.isin("B", "W") ) ) & ( (df.rsn_1 == "16") | (df.rsn_2 == "16") )
df.withColumn("FLAG", F.when(cond1 | cond2, 1)).show()

In your last line, you overwrite the FLAG column, as you’re not referencing its previous state. That’s why the previous values are not preserved.

Instead of combining the expressions, you could also use when(cond1, 1).otherwise(when(cond2, 1)). That’s a stylistic choice.

Upvotes: 1

Related Questions