scoder
scoder

Reputation: 2611

Spark copy the top N selected Rows to a new data frame

I have some data frame which has millions of rows. I need to select all rows of top 100 ids in to other data frame

I know how to get the top column_ids

 df.groupBy("some_column_id").count()

This will return the column ids and their count, now I need to filter rows of these top 100 ids to other data frame.

My sample table is below, since user 123 and 678 are having more rows than other, I want these two to be in separate data frame

enter image description here

How to do this ?

Upvotes: 0

Views: 2805

Answers (2)

Chandan Ray
Chandan Ray

Reputation: 2091

Try

val c = df.groupBy("student id").count.orderBy(desc("count")).limit(100)
val a = df.join(c,df.col("student id") === c.col("student id"), "leftsemi")

Upvotes: 2

Raphael Roth
Raphael Roth

Reputation: 27373

You can do this with Window-functions:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val newDF = df
  .withColumn("count", count("*").over(Window.partitionBy("student_id")))
  .withColumn("rank", rank().over(Window.orderBy(col("count").desc)))
  .where(col("rank") <= 100)

Upvotes: 0

Related Questions