Reputation: 171

Group on column and count on other column in spark dataframe

Data Looks like

Store| Bookno
1   101
1   102
2   301
3   401
5   501
5   502

OUTPUT EXPECTED

1|2
5|2

FOR GROUP BY STORE the output contains only greater than 1 Bookno. Trying to do with spark dataframe.

Upvotes: 0

Answers (2)

koiralo

Reputation: 23119

You can just use groupBy with count as aggregate function and filter to filter out the count less than 1

df.groupBy("STORE")
  .agg(count("BookNo").as("count"))
  .filter($"count" > 1)
  .show(false)

Output:

+-----+-----+
|STORE|count|
+-----+-----+
|1    |2    |
|5    |2    |
+-----+-----+

Upvotes: 2

s.polam

Reputation: 10382

Use window function. Check below code.

scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._

scala> val expr = row_number().over(Window.partitionBy($"store").orderBy($"store".asc))

scala> df.withColumn("rid",expr).show(false)
+-----+------+---+
|store|bookno|rid|
+-----+------+---+
|1    |101   |1  |
|1    |102   |2  |
|3    |401   |1  |
|5    |501   |1  |
|5    |502   |2  |
|2    |301   |1  |
+-----+------+---+

scala> df.withColumn("rid",expr).filter($"rid" > 1).show(false)
+-----+------+---+
|store|bookno|rid|
+-----+------+---+
|1    |102   |2  |
|5    |502   |2  |
+-----+------+---+

Or Use groupBy.

scala> df.groupBy($"store").count.show(false)
+-----+-----+
|store|count|
+-----+-----+
|1    |2    |
|3    |1    |
|5    |2    |
|2    |1    |
+-----+-----+


scala> df.groupBy($"store").count.filter($"count" > 1).show(false)
+-----+-----+
|store|count|
+-----+-----+
|1    |2    |
|5    |2    |
+-----+-----+

Upvotes: 2

Group on column and count on other column in spark dataframe

Answers (2)

Related Questions