mytabi
mytabi

Reputation: 779

Spark dataframe query for sum

Suppose I am having the following spark dataframe:

COLA   COLB  COLC
2019   A     10
2019   A     20
2019   B     30
2019   B     40

any function to generate the following dataframe?

COLA   COLB       COLC
2019   A          10
2019   A          20
      SUB_TOTAL   30
2019   B          30
2019   B          40
      SUB_TOTAL   70

Upvotes: 0

Views: 242

Answers (2)

ABHISHEK RANA
ABHISHEK RANA

Reputation: 327

val tab = Seq((2019,"A",10),
(2019,"A",20),
(2019,"B",30),
(2019,"B",40)).toDF("COLA","COLB","COLC") 
tab.show  
val inter = tab.groupBy(col("COLA"),col("COLB")).agg((sum("COLC").alias("COLC")))  
val final = tab.union(inter).orderBy(col("COLC"))  
final.show

enter image description here

Upvotes: 0

blackbishop
blackbishop

Reputation: 32660

You haven't really specified how you'd like to group by your columns as the year is the same in all rows. But you can use simple groupBy with union to achieve that:

# calculate sub totals grouping by COLB
sub_total_by_colb = df.groupBy(col("COLB")).agg(sum("COLC").alias("COLC")) \
            .select(lit(None), col("COLB"), col("COLC"))

# union original df with sub total df
df.union(sub_total_by_colb) \
    .sort(*[col("COLB").asc_nulls_last(), col("COLA").asc_nulls_last()])\
    .show()

Output:

+----+----+----+
|COLA|COLB|COLC|
+----+----+----+
|2019|   A|  10|
|2019|   A|  20|
|null|   A|30.0|
|2019|   B|  40|
|2019|   B|  30|
|null|   B|70.0|
+----+----+----+

This is said, I would, however, suggest you have a look at multi-dimensional aggregations, this is better when you want to calculate sub totals.

Example using rollup:

df.rollup(col("COLA"), col("COLB")) \
 .agg(sum("COLC").alias("COLC"), grouping_id().alias("gid")) \
 .sort(*[col("COLA").desc_nulls_last(), col("COLB").asc_nulls_last()]) \
 .filter(grouping_id() != 3) \
 .select(col("COLA"), col("COLB"), col("COLC"))\
 .show()

Upvotes: 1

Related Questions