Reputation: 171
Data Looks like
Store| Bookno
1 101
1 102
2 301
3 401
5 501
5 502
OUTPUT EXPECTED
1|2
5|2
FOR GROUP BY STORE the output contains only greater than 1 Bookno. Trying to do with spark dataframe.
Upvotes: 0
Views: 618
Reputation: 23119
You can just use groupBy
with count
as aggregate
function and filter
to filter out the count less than 1
df.groupBy("STORE")
.agg(count("BookNo").as("count"))
.filter($"count" > 1)
.show(false)
Output:
+-----+-----+
|STORE|count|
+-----+-----+
|1 |2 |
|5 |2 |
+-----+-----+
Upvotes: 2
Reputation: 10382
Use window
function. Check below code.
scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._
scala> val expr = row_number().over(Window.partitionBy($"store").orderBy($"store".asc))
scala> df.withColumn("rid",expr).show(false)
+-----+------+---+
|store|bookno|rid|
+-----+------+---+
|1 |101 |1 |
|1 |102 |2 |
|3 |401 |1 |
|5 |501 |1 |
|5 |502 |2 |
|2 |301 |1 |
+-----+------+---+
scala> df.withColumn("rid",expr).filter($"rid" > 1).show(false)
+-----+------+---+
|store|bookno|rid|
+-----+------+---+
|1 |102 |2 |
|5 |502 |2 |
+-----+------+---+
Or Use groupBy
.
scala> df.groupBy($"store").count.show(false)
+-----+-----+
|store|count|
+-----+-----+
|1 |2 |
|3 |1 |
|5 |2 |
|2 |1 |
+-----+-----+
scala> df.groupBy($"store").count.filter($"count" > 1).show(false)
+-----+-----+
|store|count|
+-----+-----+
|1 |2 |
|5 |2 |
+-----+-----+
Upvotes: 2