formicaman
formicaman

Reputation: 1357

Removing nulls from Pyspark Dataframe in individual columns

I have a pyspark dataframe like this:

+--------------------+--------------------+ | name| value| +--------------------+--------------------+ | null| null| | null| null| | null| null| | null| null| | null| null| | null| null| | null| null| | null| null| | null| null| | null| null| | id| null| | name| null| | age| null| | food| null| | null| 1| | null| Joe| | null| 47| | null| pizza| +--------------------+--------------------+

I want to remove the null values from each individual columns so the non-null data lines up.

The desired output is:

+--------------------+--------------------+ | name| value| +--------------------+--------------------+ | id| 1| | name| Joe| | age| 47| | food| pizza| +--------------------+--------------------+

I have tried removing nulls doing something like df.dropna(how='any'/'all') but and by separating out the columns and removing the nulls, but then it becomes difficult to join them back together.

Upvotes: 1

Views: 204

Answers (1)

Som
Som

Reputation: 6338

try this- written in scala, but can be ported to pyspark with minimal change

   df.select(map_from_arrays(collect_list("name").as("name"),
      collect_list("value").as("value")).as("map"))
      .select(explode_outer($"map").as(Seq("name", "value")))
      .show(false)

    /**
      * +----+-----+
      * |name|value|
      * +----+-----+
      * |id  |1    |
      * |name|Joe  |
      * |age |47   |
      * |food|pizza|
      * +----+-----+
      */

Upvotes: 1

Related Questions