Reputation: 673
I have a PySpark dataframe with multiple columns (around 320)
I have to find keyword baz
in col A
. in case the baz
is found, then replace the existing value in all columns listed in the list columns_for_replacement
with None
columns_for_replacement = ["B", "C", "D", "E", "F", "G", "H", "I"]
I am trying to modify the below code to do the same:
for i in columns_for_replacement:
df = df.withColumn(i,when((col(i)=='baz'),None).otherwise(col(i)))
The above code works only for the specific column which is not my expected requirement.
Base dataframe:
A B C D E F G H I J
baz abc abc abc abc abc abc abc abc abc
baz abc abc abc abc abc abc abc abc abc
def abc abc abc abc abc abc abc abc abc
baz abc abc abc abc abc abc abc abc abc
map abc abc abc abc abc abc abc abc abc
baz abc abc abc abc abc abc abc abc abc
noo abc abc abc abc abc abc abc abc abc
Expected dataframe:
A B C D E F G H I J
baz abc
baz abc
def abc abc abc abc abc abc abc abc abc
baz abc
map abc abc abc abc abc abc abc abc abc
baz abc
noo abc abc abc abc abc abc abc abc abc
Upvotes: 1
Views: 2463
Reputation: 32720
You need to check column A
:
from pyspark.sql import functions as F
df1 = df.select(
col("A"),
*[F.when(F.col("A") == "baz", F.lit(None)).otherwise(F.col(c)).alias(c) for c in columns_for_replacement],
col("J")
)
df1.show()
#+---+----+----+----+----+----+----+----+----+---+
#| A| B| C| D| E| F| G| H| I| J|
#+---+----+----+----+----+----+----+----+----+---+
#|baz|null|null|null|null|null|null|null|null|abc|
#|baz|null|null|null|null|null|null|null|null|abc|
#|def| abc| abc| abc| abc| abc| abc| abc| abc|abc|
#|baz|null|null|null|null|null|null|null|null|abc|
#|map| abc| abc| abc| abc| abc| abc| abc| abc|abc|
#|baz|null|null|null|null|null|null|null|null|abc|
#|noo| abc| abc| abc| abc| abc| abc| abc| abc|abc|
#+---+----+----+----+----+----+----+----+----+---+
Or more generic to avoid select each column other than in the list to replace:
df1 = df.select(
*[F.when(F.col("A") == "baz", F.lit(None)).otherwise(F.col(c)).alias(c) for c in columns_for_replacement],
*[F.col(c) for c in df.columns if c not in columns_for_replacement]
).toDF(*df.columns)
Upvotes: 1
Reputation: 42422
I think you meant to use column A
in the when condition, not column i
:
columns_for_replacement = ["B", "C", "D", "E", "F", "G", "H", "I"]
for i in columns_for_replacement:
df = df.withColumn(i, when((col("A")=='baz'), lit(None)).otherwise(col(i)))
Another way to do this would be to do a select
:
columns_for_replacement = ["B", "C", "D", "E", "F", "G", "H", "I"]
df2 = df.select([
when((col("A")=='baz'), lit(None)).otherwise(col(c)).alias(c)
if c in columns_for_replacement
else col(c)
for c in df.columns
])
Upvotes: 2