Reputation: 55
I have a dataframe with USERID(thus we group by this), and other variables statuses and a date.
Some of these USERID's have more than 5 statuses, so we should keep only the 5 most recent ones,by date.
How should I code this, looks simple but I haven't manage to do so.
Upvotes: 0
Views: 69
Reputation: 887108
We can use data.table
. Convert the 'data.frame' to 'data.table' (setDT(df1)
), grouped by 'USERID', we order
the 'date' in decreasing (Assuming that the 'date' column is Date
class) and get the first 5 rows with head
library(data.table)
setDT(df1)[order(-date), head(.SD, 5), by=USERID]
Or as @Symbolix mentioned in the comments, we could also make use of .I
to get the row index and later remove the NA rows for groups that don't have 5 rows)
na.omit(setDT(df1)[df1[order(-date), .I[1:5], by= USERID]$V1 ])
set.seed(49)
df1 <- data.frame(USERID= sample(LETTERS[1:3], 12,
replace=TRUE), date= sample(seq(as.Date('2014-01-01'),
length.out=12, by = '1 day')))
Upvotes: 2
Reputation: 26258
If you're a fan of dplyr
you can do
library(dplyr)
df %>%
group_by(USERID) %>%
arrange(-date) %>%
slice(1:5) %>%
ungroup
On 'large' data sets the data.table
approach will likely be faster, but dplyr
has a slightly easier syntax to get your head around at first (in my opinion).
Upvotes: 2