Ranger J
Ranger J

Reputation: 11

Loop through columns in a dataframe and add a new column to the dataframe with the first non null value found. Using PySpark

I'm new to PySpark trying to figure out how to achieve the desired results below.

I have a dataframe which contains several columns. I want to loop through columns id1, id2 and id3 and once the first non null value is found a new column should be added with this value. After finding the value no more loop for that record is needed.

The dataframe:

name id1 hobby id2 gender id3 language
Mike AAA-BBB Fishing M AAA-BBB Eng
Louis ABC-DDD M
Peter DSA-SDF Hunting DSA-SDF M DSA-SDF Eng

The desired dataframe:

name id1 hobby id2 gender id3 language id
Mike AAA-BBB Fishing M AAA-BBB Eng AAA-BBB
Louis ABC-DDD M ABC-DDD
Peter DSA-SDF Hunting DSA-SDF M DSA-SDF Eng DSA-SDF

Any help would be greatly appreciated.

Upvotes: 1

Views: 55

Answers (1)

BoomBoxBoy
BoomBoxBoy

Reputation: 1885

You can do that with the following

from pyspark.sql import functions

df = (df.withColumn("id", functions.when(df["id1"].isNotNull(), df["id1"])
                                   .when(df["id2"].isNotNull(), df["id2"])
                                   .when(df["id3"].isNotNull(), df["id3"])))

df.show()
+-----+-------+-------+-------+------+-------+--------+-------+-------+
| name|    id1|  hobby|    id2|gender|    id3|language|   test|     id|
+-----+-------+-------+-------+------+-------+--------+-------+-------+
| Mike|AAA-BBB|Fishing|   null|     M|AAA-BBB|     Eng|AAA-BBB|AAA-BBB|
|Louis|   null|   null|ABC-DDD|     M|   null|    null|ABC-DDD|ABC-DDD|
|Peter|DSA-SDF|Hunting|DSA-SDF|     M|DSA-SDF|     Eng|DSA-SDF|DSA-SDF|
+-----+-------+-------+-------+------+-------+--------+-------+-------+

Upvotes: 0

Related Questions