Laodao
Laodao

Reputation: 1709

spark dataframe how to get the latest n rows using java

I am new in Spark. Right now I am trying to join two DataFrames together. I want to keep my dataframes in 5000 rows. Since my first dataframe has already get 5000 rows, I need to get latest 4000 rows as my second dataframe has 1000 rows. Can someone help me on how to get a dataframe with the latest 4000 rows in the first dataframe? Thanks in advance.

Upvotes: 0

Views: 3200

Answers (2)

Quentin
Quentin

Reputation: 3290

Starting with spark 3.4 you can use OFFSET to skip the first N rows. In your case, you could count the total number of rows and then skip the beginning, something like this:

val count = df.count()
df = spark.sql(s'select * from your_df OFFSET ${count-4000}')

Note that this is not deterministic if you don't have an ORDER BY clause

Upvotes: 0

Ewan Leith
Ewan Leith

Reputation: 1665

I'm not sure what you're really hoping to achieve this way, but if you're in Spark 1.5 you could do something like this using monotonicallyIncreasingId:

val df4000 = df.sort(monotonicallyIncreasingId().desc).limit(4000)

which will sort in a descending order by the ID for each row in the dataframe, then limit the results to the first 4000.

Otherwise you could do the same using any column that you know increases consistently.

Upvotes: 3

Related Questions