Reputation: 642
I have a Spark DataFrame as follows:
val someDF5 = Seq(
("202003101750", "202003101700",122),
("202003101800", "202003101700",12),
("202003101750", "202003101700",42),
("202003101810", "202003101700",2)
).toDF("number", "word","value")
With a column num_records
by doing the following:
val DF1 = someDF5.groupBy("number","word").agg(count("*").alias("num_records"))
DF1:
+------------+------------+-------------+
| number| word|num_records |
+------------+------------+-------------+
|202003101750|202003101700| 2|
|202003101800|202003101700| 1|
|202003101810|202003101700| 1|
+------------+------------+-------------+
How can I add another column say total_records
which keeps track of the total of num_records
and adds to the dataframe? For example, this is what I expect:
+------------+------------+-------------+-------------+--
| number| word|num_records |total_records |
+------------+------------+-------------+----------------
|202003101750|202003101700| 2| 4 |
|202003101800|202003101700| 1| 4 |
|202003101810|202003101700| 1| 4 |
+------------+------------+-------------+----------------
Note: total_records should keep updating/adding whenever num_records changes
Upvotes: 1
Views: 871
Reputation: 29165
add withColumn and count thats all..
val someDF5 = Seq(
("202003101750", "202003101700", 122),
("202003101800", "202003101700", 12),
("202003101750", "202003101700", 42),
("202003101810", "202003101700", 2)
).toDF("number", "word", "value")
val DF1 = someDF5.groupBy("number", "word").agg(count("*").alias("num_records"))
.withColumn("total_records",lit(someDF5.count))
DF1.show
Result :
+------------+------------+-----------+-------------+
| number| word|num_records|total_records|
+------------+------------+-----------+-------------+
|202003101750|202003101700| 2| 4|
|202003101800|202003101700| 1| 4|
|202003101810|202003101700| 1| 4|
+------------+------------+-----------+-------------+
number of records increased like this count is automatically updated.
val someDF5 = Seq(
("202003101750", "202003101700", 122),
("202003101800", "202003101700", 12),
("202003101750", "202003101700", 42),
("202003101810", "202003101700", 2),
("202003101810", "22222222", 222)
).toDF("number", "word", "value")
val DF1 = someDF5.groupBy("number", "word").agg(count("*").alias("num_records"))
.withColumn("total_records",lit(someDF5.count))
Result :
+------------+------------+-----------+-------------+
| number| word|num_records|total_records|
+------------+------------+-----------+-------------+
|202003101750|202003101700| 2| 5|
|202003101800|202003101700| 1| 5|
|202003101810|202003101700| 1| 5|
|202003101810| 22222222| 1| 5|
+------------+------------+-----------+-------------+
Upvotes: 2
Reputation: 4481
I think you can do it creating new dataframe with sum:
val total = DF1.agg(sum(col("num_records"))).head().getAs[Long](0)
val dfWithTotal = DF1.withColumn("total_records", lit(total))
dfWithTotal.show()
+------------+------------+-----------+-------------+
| number| word|num_records|total_records|
+------------+------------+-----------+-------------+
|202003101810|202003101700| 1| 4|
|202003101750|202003101700| 2| 4|
|202003101800|202003101700| 1| 4|
+------------+------------+-----------+-------------+
Upvotes: 1