Reputation: 37

Choose top n variables in R when matching values

I have a large timeseries dataset, and would like to choose the top 10 observations from each date based one the values in one of my columns.

I am able to do this using group_by(Date) %>% top_n(10)

However, if the values for the 10th and 11th observation are equal, then they are both picked, so that I get 11 observations instead of 10.

Do anyone know what i can do to make sure that only 10 observations are chosen?

Upvotes: 0

Answers (3)

Reputation: 887241

We can use base R

df1 <- df[with(df, order(Date, -value)),]
df1[with(df1, ave(seq_along(Date), Date, FUN = function(x) x %in% 1:10)),]

Upvotes: 0

Reputation: 6226

With data.table you can do

library(data.table)
setDT(df)
df[order(Date, desc(value))][, .SD[1:10], by = Date]

Change value to match the variable name used to choose which observation should be kept in case of ties. You can also do:

df[order(Date, desc(value))][, head(.SD,10), by = Date]

Upvotes: 0

Reputation: 389055

You can arrange the data and select first 10 rows in each group.

library(dplyr)
df %>% arrange(Date, desc(col_name)) %>% group_by(Date) %>% slice(1:10)

Similarly, with filter

df %>% 
 arrange(Date, desc(col_name)) %>% 
 group_by(Date) %>% 
 filter(row_number() <= 10)

Upvotes: 1