Reputation: 213

combine multiple row in Spark

I wonder if there is any easy way to combine multiple rows into one in Pyspark, I am new to Python and Spark and been using Spark.sql most of the time.

Here is a data example:

id      count1    count2   count3
 1       null       1       null
 1        3         null    null
 1        null      null      5
 2        null      1        null
 2        1         null     null
 2        null      null      2

the expected output is :

 id      count1    count2   count3
 1       3          1       5
 2       1         1        2

I been using spark SQL to join them multiple times, and wonder if there is any easier way to do that.

Thank you!

Upvotes: 2

Answers (1)

Dave

Reputation: 2059

Spark SQL will sum null as zero, so if you know there are no "overlapping" data elements, just group by the column you wish aggregate to and sum.

Assuming that you want to keep your original column names (and not sum the id column), you'll need to specify the columns that are summed and then rename them after the aggregation.

before.show()                                                                                                                                                                                      
+---+------+------+------+
| id|count1|count2|count3|
+---+------+------+------+
|  1|  null|     1|  null|
|  1|     3|  null|  null|
|  1|  null|  null|     5|
|  2|  null|     1|  null|
|  2|     1|  null|  null|
|  2|  null|  null|     2|
+---+------+------+------+

after = before
   .groupby('id').sum(*[c for c in before.columns if c != 'id'])
   .select([col(f"sum({c})").alias(c) for c in before.columns if c != 'id'])

after.show()                                                                                                                                                                                       
+------+------+------+
|count1|count2|count3|
+------+------+------+
|     3|     1|     5|
|     1|     1|     2|
+------+------+------+

Upvotes: 3

combine multiple row in Spark

Answers (1)

Related Questions