Reputation: 13
I have a Spark Dataframe as below
ID | Col A | Col B |
---|---|---|
1 | null | Some Value |
2 | Some Value | null |
I need to add a new column which contains the column name (among Col A and Col B) which is not null. So the expected dataframe should look like,
ID | Col A | Col B | result |
---|---|---|---|
1 | null | Some Value | Col B |
2 | Some Value | null | Col A |
Any help would be much appreciated.
Thank you!
Upvotes: 0
Views: 899
Reputation: 10035
after creating temp views from your dataframe eg
df.createOrReplaceTempView("my_data")
you may run the following on your spark session using newdf = sparkSession.sql("query here")
SELECT
ID,
ColA,
ColB,
CASE
WHEN ColA IS NULL AND ColB IS NULL THEN NULL
WHEN ColB IS NULL THEN 'ColA'
WHEN ColA IS NULL THEN 'ColB'
ELSE 'ColA Col B'
END AS result
FROM my_data
or just using python
from pyspark.sql.functions import when, col
df = df.withColumn("result",when(
col("Col A").isNull() & col("Col B").isNull() , None
).when(
col("Col B").isNull() ,'Col A'
).when(
col("Col A").isNull() ,'Col B'
).otherwise('Col A Col B')
)
Upvotes: 1