Spark copy the top N selected Rows to a new data frame

Question

I have some data frame which has millions of rows. I need to select all rows of top 100 ids in to other data frame

I know how to get the top column_ids

 df.groupBy("some_column_id").count()

This will return the column ids and their count, now I need to filter rows of these top 100 ids to other data frame.

My sample table is below, since user 123 and 678 are having more rows than other, I want these two to be in separate data frame

How to do this ?

Chandan Ray · Accepted Answer

Try

val c = df.groupBy("student id").count.orderBy(desc("count")).limit(100)
val a = df.join(c,df.col("student id") === c.col("student id"), "leftsemi")

Answers (2)