Reputation: 37
I have a large timeseries dataset, and would like to choose the top 10 observations from each date based one the values in one of my columns.
I am able to do this using group_by(Date) %>% top_n(10)
However, if the values for the 10th and 11th observation are equal, then they are both picked, so that I get 11 observations instead of 10.
Do anyone know what i can do to make sure that only 10 observations are chosen?
Upvotes: 0
Views: 161
Reputation: 887241
We can use base R
df1 <- df[with(df, order(Date, -value)),]
df1[with(df1, ave(seq_along(Date), Date, FUN = function(x) x %in% 1:10)),]
Upvotes: 0
Reputation: 6226
With data.table
you can do
library(data.table)
setDT(df)
df[order(Date, desc(value))][, .SD[1:10], by = Date]
Change value
to match the variable name used to choose which observation should be kept in case of ties. You can also do:
df[order(Date, desc(value))][, head(.SD,10), by = Date]
Upvotes: 0
Reputation: 389055
You can arrange
the data and select first 10 rows in each group.
library(dplyr)
df %>% arrange(Date, desc(col_name)) %>% group_by(Date) %>% slice(1:10)
Similarly, with filter
df %>%
arrange(Date, desc(col_name)) %>%
group_by(Date) %>%
filter(row_number() <= 10)
Upvotes: 1