Reputation: 79
val someDF = Seq(
(4623874, "user1", "success"),
(4623874, "user2","fail"),
(4623874, "user3","success"),
(1343244, "user4","fail"),
(4235252, "user5", "fail")
).toDF("primaryid", "user","status")
This is the input data frame is it possible to get the count status for each primary id other than groupby
someDF.groupBy("primaryid", "status").count.show
+-------+-------+-----+
primaryid| status|count|
+-------+-------+-----+
|4235252| fail| 1|
|1343244| fail| 1|
|4623874| fail| 1|
|4623874|success| 2|
+-------+-------+-----+
Any other way to get the above result other than "groupby" ?
Upvotes: 0
Views: 728
Reputation: 10382
Use count
window function. Check below code.
scala> val someDF = Seq(
| (4623874, "user1", "success"),
| (4623874, "user2","fail"),
| (4623874, "user3","success"),
| (1343244, "user4","fail"),
| (4235252, "user5", "fail")
| ).toDF("primaryid", "user","status")
scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._
someDF
.withColumn("count",
count($"status")
.over(
Window
.partitionBy($"primaryid",$"status")
.orderBy($"primaryid".asc)
)
).show(false)
+---------+-----+-------+-----+
|primaryid|user |status |count|
+---------+-----+-------+-----+
|4235252 |user5|fail |1 |
|1343244 |user4|fail |1 |
|4623874 |user2|fail |1 |
|4623874 |user1|success|2 |
|4623874 |user3|success|2 |
+---------+-----+-------+-----+
scala> :paste
// Entering paste mode (ctrl-D to finish)
someDF
.withColumn("count",
count($"status")
.over(
Window
.partitionBy($"primaryid",$"status")
.orderBy($"primaryid".asc)
)
)
.filter($"status" === "success")
.show(false)
// Exiting paste mode, now interpreting.
+---------+-----+-------+-----+
|primaryid|user |status |count|
+---------+-----+-------+-----+
|4623874 |user1|success|2 |
|4623874 |user3|success|2 |
+---------+-----+-------+-----+
Upvotes: 1