Reputation: 213
I wonder if there is any easy way to combine multiple rows into one in Pyspark, I am new to Python and Spark and been using Spark.sql most of the time.
Here is a data example:
id count1 count2 count3
1 null 1 null
1 3 null null
1 null null 5
2 null 1 null
2 1 null null
2 null null 2
the expected output is :
id count1 count2 count3
1 3 1 5
2 1 1 2
I been using spark SQL to join them multiple times, and wonder if there is any easier way to do that.
Thank you!
Upvotes: 2
Views: 453
Reputation: 2049
Spark SQL will sum null
as zero, so if you know there are no "overlapping" data elements, just group by the column you wish aggregate to and sum.
Assuming that you want to keep your original column names (and not sum the id
column), you'll need to specify the columns that are summed and then rename them after the aggregation.
before.show()
+---+------+------+------+
| id|count1|count2|count3|
+---+------+------+------+
| 1| null| 1| null|
| 1| 3| null| null|
| 1| null| null| 5|
| 2| null| 1| null|
| 2| 1| null| null|
| 2| null| null| 2|
+---+------+------+------+
after = before
.groupby('id').sum(*[c for c in before.columns if c != 'id'])
.select([col(f"sum({c})").alias(c) for c in before.columns if c != 'id'])
after.show()
+------+------+------+
|count1|count2|count3|
+------+------+------+
| 3| 1| 5|
| 1| 1| 2|
+------+------+------+
Upvotes: 3