spark dataframe how to get the latest n rows using java

Question

I am new in Spark. Right now I am trying to join two DataFrames together. I want to keep my dataframes in 5000 rows. Since my first dataframe has already get 5000 rows, I need to get latest 4000 rows as my second dataframe has 1000 rows. Can someone help me on how to get a dataframe with the latest 4000 rows in the first dataframe? Thanks in advance.

Ewan Leith · Accepted Answer

I'm not sure what you're really hoping to achieve this way, but if you're in Spark 1.5 you could do something like this using monotonicallyIncreasingId:

val df4000 = df.sort(monotonicallyIncreasingId().desc).limit(4000)

which will sort in a descending order by the ID for each row in the dataframe, then limit the results to the first 4000.

Otherwise you could do the same using any column that you know increases consistently.

spark dataframe how to get the latest n rows using java

Answers (2)

Related Questions