scoder
scoder

Reputation: 2611

Spark merge rows based on some condition and retain the values

I have some dataframe like below :

+--------------------+-------------+----------+--------------------+------------------+
|       cond         | val         |  val1    |          val2      |         val3     |
+--------------------+-------------+----------+--------------------+------------------+
|cond1               | 1           | null     | null               |        null      |
|cond1               | null        | 2        | null               |        null      |
|cond1               | null        | null     | 3                  |        null      |
|cond1               | null        | null     | null               |        4         |
|cond2               | null        | null     | null               |        44        |
|cond2               | null        | 22       | null               |        null      |
|cond2               | null        | null     | 33                 |        null      |
|cond2               | 11          | null     | null               |        null      |
|cond3               | null        | null     | null               |        444       |
|cond3               | 111         | 222      | null               |        null      |
|cond3               | 1111        | null     | null               |        null      |
|cond3               | null        | null     | 333                |        null      | 

I want to reduce the numbers based value of the some column, I want the resultant column to look like below :

+--------------------+-------------+----------+--------------------+------------------+
|       cond         | val         |  val1    |          val2      |         val3     |
+--------------------+-------------+----------+--------------------+------------------+
|cond1               | 1           | 2        | 3                  |        4         |
|cond2               | 11          | 22       | 33                 |        44        |
|cond3               | 111,1111    | 222      | 333                |        444       |

Upvotes: 0

Views: 1259

Answers (1)

Danny Varod
Danny Varod

Reputation: 18118

Try using .groupBy() and .agg() e.g.

val output = input.groupBy("cond")
  .agg(collect_list("val").name("val"))
  .agg(collect_list("val1").name("val1"))
  .agg(collect_list("val2").name("val2"))
  .agg(collect_list("val3").name("val3"))

Upvotes: 1

Related Questions