Reputation: 779
Suppose I am having the following spark dataframe:
COLA COLB COLC
2019 A 10
2019 A 20
2019 B 30
2019 B 40
any function to generate the following dataframe?
COLA COLB COLC
2019 A 10
2019 A 20
SUB_TOTAL 30
2019 B 30
2019 B 40
SUB_TOTAL 70
Upvotes: 0
Views: 242
Reputation: 327
val tab = Seq((2019,"A",10),
(2019,"A",20),
(2019,"B",30),
(2019,"B",40)).toDF("COLA","COLB","COLC")
tab.show
val inter = tab.groupBy(col("COLA"),col("COLB")).agg((sum("COLC").alias("COLC")))
val final = tab.union(inter).orderBy(col("COLC"))
final.show
Upvotes: 0
Reputation: 32660
You haven't really specified how you'd like to group by your columns as the year is the same in all rows.
But you can use simple groupBy
with union
to achieve that:
# calculate sub totals grouping by COLB
sub_total_by_colb = df.groupBy(col("COLB")).agg(sum("COLC").alias("COLC")) \
.select(lit(None), col("COLB"), col("COLC"))
# union original df with sub total df
df.union(sub_total_by_colb) \
.sort(*[col("COLB").asc_nulls_last(), col("COLA").asc_nulls_last()])\
.show()
Output:
+----+----+----+
|COLA|COLB|COLC|
+----+----+----+
|2019| A| 10|
|2019| A| 20|
|null| A|30.0|
|2019| B| 40|
|2019| B| 30|
|null| B|70.0|
+----+----+----+
This is said, I would, however, suggest you have a look at multi-dimensional aggregations, this is better when you want to calculate sub totals.
Example using rollup
:
df.rollup(col("COLA"), col("COLB")) \
.agg(sum("COLC").alias("COLC"), grouping_id().alias("gid")) \
.sort(*[col("COLA").desc_nulls_last(), col("COLB").asc_nulls_last()]) \
.filter(grouping_id() != 3) \
.select(col("COLA"), col("COLB"), col("COLC"))\
.show()
Upvotes: 1