coderWorld
coderWorld

Reputation: 642

How do I sum a column and add the summed column to a Spark DataFrame?

I have a Spark DataFrame as follows:

val someDF5 = Seq(
  ("202003101750", "202003101700",122),
  ("202003101800", "202003101700",12),
  ("202003101750", "202003101700",42),
  ("202003101810", "202003101700",2)
).toDF("number", "word","value")

With a column num_records by doing the following:

val DF1 = someDF5.groupBy("number","word").agg(count("*").alias("num_records"))

DF1:

+------------+------------+-------------+                                       
|      number|        word|num_records  |
+------------+------------+-------------+
|202003101750|202003101700|            2|
|202003101800|202003101700|            1|
|202003101810|202003101700|            1|
+------------+------------+-------------+

How can I add another column say total_records which keeps track of the total of num_records and adds to the dataframe? For example, this is what I expect:

+------------+------------+-------------+-------------+--                                       
|      number|        word|num_records  |total_records  |
+------------+------------+-------------+----------------
|202003101750|202003101700|            2|             4 |
|202003101800|202003101700|            1|             4 |
|202003101810|202003101700|            1|             4 |
+------------+------------+-------------+----------------

Note: total_records should keep updating/adding whenever num_records changes

Upvotes: 1

Views: 871

Answers (2)

Ram Ghadiyaram
Ram Ghadiyaram

Reputation: 29165

add withColumn and count thats all..

val someDF5 = Seq(
    ("202003101750", "202003101700", 122),
    ("202003101800", "202003101700", 12),
    ("202003101750", "202003101700", 42),
    ("202003101810", "202003101700", 2)

  ).toDF("number", "word", "value")
  val DF1 = someDF5.groupBy("number", "word").agg(count("*").alias("num_records"))
    .withColumn("total_records",lit(someDF5.count))
  DF1.show

Result :

+------------+------------+-----------+-------------+
|      number|        word|num_records|total_records|
+------------+------------+-----------+-------------+
|202003101750|202003101700|          2|            4|
|202003101800|202003101700|          1|            4|
|202003101810|202003101700|          1|            4|
+------------+------------+-----------+-------------+

number of records increased like this count is automatically updated.

 val someDF5 = Seq(
    ("202003101750", "202003101700", 122),
    ("202003101800", "202003101700", 12),
    ("202003101750", "202003101700", 42),
    ("202003101810", "202003101700", 2),
      ("202003101810", "22222222", 222)
  ).toDF("number", "word", "value")
  val DF1 = someDF5.groupBy("number", "word").agg(count("*").alias("num_records"))
    .withColumn("total_records",lit(someDF5.count))

Result :

+------------+------------+-----------+-------------+
|      number|        word|num_records|total_records|
+------------+------------+-----------+-------------+
|202003101750|202003101700|          2|            5|
|202003101800|202003101700|          1|            5|
|202003101810|202003101700|          1|            5|
|202003101810|    22222222|          1|            5|
+------------+------------+-----------+-------------+

Upvotes: 2

Boris Azanov
Boris Azanov

Reputation: 4481

I think you can do it creating new dataframe with sum:

val total = DF1.agg(sum(col("num_records"))).head().getAs[Long](0)
val dfWithTotal = DF1.withColumn("total_records", lit(total))
dfWithTotal.show()

+------------+------------+-----------+-------------+
|      number|        word|num_records|total_records|
+------------+------------+-----------+-------------+
|202003101810|202003101700|          1|            4|
|202003101750|202003101700|          2|            4|
|202003101800|202003101700|          1|            4|
+------------+------------+-----------+-------------+

Upvotes: 1

Related Questions