RSM
RSM

Reputation: 673

Replace values in multiple columns based on value of one column

I have a PySpark dataframe with multiple columns (around 320)

I have to find keyword baz in col A. in case the baz is found, then replace the existing value in all columns listed in the list columns_for_replacement with None

columns_for_replacement = ["B", "C", "D", "E", "F", "G", "H", "I"]

I am trying to modify the below code to do the same:

for i in columns_for_replacement:
    df = df.withColumn(i,when((col(i)=='baz'),None).otherwise(col(i)))  

The above code works only for the specific column which is not my expected requirement.

Base dataframe:

A   B   C   D   E   F   G   H   I   J   
baz abc abc abc abc abc abc abc abc abc
baz abc abc abc abc abc abc abc abc abc
def abc abc abc abc abc abc abc abc abc
baz abc abc abc abc abc abc abc abc abc
map abc abc abc abc abc abc abc abc abc
baz abc abc abc abc abc abc abc abc abc
noo abc abc abc abc abc abc abc abc abc

Expected dataframe:

A   B   C   D   E   F   G   H   I   J   
baz                                 abc
baz                                 abc
def abc abc abc abc abc abc abc abc abc
baz                                 abc
map abc abc abc abc abc abc abc abc abc
baz                                 abc
noo abc abc abc abc abc abc abc abc abc

Upvotes: 1

Views: 2463

Answers (2)

blackbishop
blackbishop

Reputation: 32720

You need to check column A :

from pyspark.sql import functions as F

df1 = df.select(
    col("A"),
    *[F.when(F.col("A") == "baz", F.lit(None)).otherwise(F.col(c)).alias(c) for c in columns_for_replacement],
    col("J")
)

df1.show()

#+---+----+----+----+----+----+----+----+----+---+
#|  A|   B|   C|   D|   E|   F|   G|   H|   I|  J|
#+---+----+----+----+----+----+----+----+----+---+
#|baz|null|null|null|null|null|null|null|null|abc|
#|baz|null|null|null|null|null|null|null|null|abc|
#|def| abc| abc| abc| abc| abc| abc| abc| abc|abc|
#|baz|null|null|null|null|null|null|null|null|abc|
#|map| abc| abc| abc| abc| abc| abc| abc| abc|abc|
#|baz|null|null|null|null|null|null|null|null|abc|
#|noo| abc| abc| abc| abc| abc| abc| abc| abc|abc|
#+---+----+----+----+----+----+----+----+----+---+

Or more generic to avoid select each column other than in the list to replace:

df1 = df.select(
    *[F.when(F.col("A") == "baz", F.lit(None)).otherwise(F.col(c)).alias(c) for c in columns_for_replacement],
    *[F.col(c) for c in df.columns if c not in columns_for_replacement]
).toDF(*df.columns)

Upvotes: 1

mck
mck

Reputation: 42422

I think you meant to use column A in the when condition, not column i:

columns_for_replacement = ["B", "C", "D", "E", "F", "G", "H", "I"]

for i in columns_for_replacement:
    df = df.withColumn(i, when((col("A")=='baz'), lit(None)).otherwise(col(i)))

Another way to do this would be to do a select:

columns_for_replacement = ["B", "C", "D", "E", "F", "G", "H", "I"]

df2 = df.select([
    when((col("A")=='baz'), lit(None)).otherwise(col(c)).alias(c)
    if c in columns_for_replacement 
    else col(c)
    for c in df.columns
])

Upvotes: 2

Related Questions