Reputation: 11
I'm new to PySpark trying to figure out how to achieve the desired results below.
I have a dataframe which contains several columns. I want to loop through columns id1, id2 and id3 and once the first non null value is found a new column should be added with this value. After finding the value no more loop for that record is needed.
The dataframe:
name | id1 | hobby | id2 | gender | id3 | language |
---|---|---|---|---|---|---|
Mike | AAA-BBB | Fishing | M | AAA-BBB | Eng | |
Louis | ABC-DDD | M | ||||
Peter | DSA-SDF | Hunting | DSA-SDF | M | DSA-SDF | Eng |
The desired dataframe:
name | id1 | hobby | id2 | gender | id3 | language | id |
---|---|---|---|---|---|---|---|
Mike | AAA-BBB | Fishing | M | AAA-BBB | Eng | AAA-BBB | |
Louis | ABC-DDD | M | ABC-DDD | ||||
Peter | DSA-SDF | Hunting | DSA-SDF | M | DSA-SDF | Eng | DSA-SDF |
Any help would be greatly appreciated.
Upvotes: 1
Views: 55
Reputation: 1885
You can do that with the following
from pyspark.sql import functions
df = (df.withColumn("id", functions.when(df["id1"].isNotNull(), df["id1"])
.when(df["id2"].isNotNull(), df["id2"])
.when(df["id3"].isNotNull(), df["id3"])))
df.show()
+-----+-------+-------+-------+------+-------+--------+-------+-------+
| name| id1| hobby| id2|gender| id3|language| test| id|
+-----+-------+-------+-------+------+-------+--------+-------+-------+
| Mike|AAA-BBB|Fishing| null| M|AAA-BBB| Eng|AAA-BBB|AAA-BBB|
|Louis| null| null|ABC-DDD| M| null| null|ABC-DDD|ABC-DDD|
|Peter|DSA-SDF|Hunting|DSA-SDF| M|DSA-SDF| Eng|DSA-SDF|DSA-SDF|
+-----+-------+-------+-------+------+-------+--------+-------+-------+
Upvotes: 0