Loop through columns in a dataframe and add a new column to the dataframe with the first non null value found. Using PySpark

Question

I'm new to PySpark trying to figure out how to achieve the desired results below.

I have a dataframe which contains several columns. I want to loop through columns id1, id2 and id3 and once the first non null value is found a new column should be added with this value. After finding the value no more loop for that record is needed.

The dataframe:

name	id1	hobby	id2	gender	id3	language
Mike	AAA-BBB	Fishing		M	AAA-BBB	Eng
Louis			ABC-DDD	M
Peter	DSA-SDF	Hunting	DSA-SDF	M	DSA-SDF	Eng

The desired dataframe:

name	id1	hobby	id2	gender	id3	language	id
Mike	AAA-BBB	Fishing		M	AAA-BBB	Eng	AAA-BBB
Louis			ABC-DDD	M			ABC-DDD
Peter	DSA-SDF	Hunting	DSA-SDF	M	DSA-SDF	Eng	DSA-SDF

Any help would be greatly appreciated.

Loop through columns in a dataframe and add a new column to the dataframe with the first non null value found. Using PySpark

Answers (1)

Related Questions