Reputation: 1357
I have a pyspark dataframe like this:
+--------------------+--------------------+
| name| value|
+--------------------+--------------------+
| null| null|
| null| null|
| null| null|
| null| null|
| null| null|
| null| null|
| null| null|
| null| null|
| null| null|
| null| null|
| id| null|
| name| null|
| age| null|
| food| null|
| null| 1|
| null| Joe|
| null| 47|
| null| pizza|
+--------------------+--------------------+
I want to remove the null values from each individual columns so the non-null data lines up.
The desired output is:
+--------------------+--------------------+
| name| value|
+--------------------+--------------------+
| id| 1|
| name| Joe|
| age| 47|
| food| pizza|
+--------------------+--------------------+
I have tried removing nulls doing something like df.dropna(how='any'/'all')
but and by separating out the columns and removing the nulls, but then it becomes difficult to join them back together.
Upvotes: 1
Views: 204
Reputation: 6338
try this- written in scala, but can be ported to pyspark with minimal change
df.select(map_from_arrays(collect_list("name").as("name"),
collect_list("value").as("value")).as("map"))
.select(explode_outer($"map").as(Seq("name", "value")))
.show(false)
/**
* +----+-----+
* |name|value|
* +----+-----+
* |id |1 |
* |name|Joe |
* |age |47 |
* |food|pizza|
* +----+-----+
*/
Upvotes: 1