Reputation: 2611
I have some data frame which has millions of rows. I need to select all rows of top 100 ids in to other data frame
I know how to get the top column_ids
df.groupBy("some_column_id").count()
This will return the column ids and their count, now I need to filter rows of these top 100 ids to other data frame.
My sample table is below, since user 123 and 678 are having more rows than other, I want these two to be in separate data frame
How to do this ?
Upvotes: 0
Views: 2805
Reputation: 2091
Try
val c = df.groupBy("student id").count.orderBy(desc("count")).limit(100)
val a = df.join(c,df.col("student id") === c.col("student id"), "leftsemi")
Upvotes: 2
Reputation: 27373
You can do this with Window-functions:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val newDF = df
.withColumn("count", count("*").over(Window.partitionBy("student_id")))
.withColumn("rank", rank().over(Window.orderBy(col("count").desc)))
.where(col("rank") <= 100)
Upvotes: 0