Keep only first five rows if group has more than five rows

Question

I have a dataframe with USERID(thus we group by this), and other variables statuses and a date.

Some of these USERID's have more than 5 statuses, so we should keep only the 5 most recent ones,by date.

How should I code this, looks simple but I haven't manage to do so.

akrun · Accepted Answer

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'USERID', we order the 'date' in decreasing (Assuming that the 'date' column is Date class) and get the first 5 rows with head

library(data.table)
setDT(df1)[order(-date), head(.SD, 5), by=USERID]

Or as @Symbolix mentioned in the comments, we could also make use of .I to get the row index and later remove the NA rows for groups that don't have 5 rows)

 na.omit(setDT(df1)[df1[order(-date), .I[1:5], by= USERID]$V1 ])

data

set.seed(49)
df1 <- data.frame(USERID= sample(LETTERS[1:3], 12, 
  replace=TRUE), date= sample(seq(as.Date('2014-01-01'), 
  length.out=12, by = '1 day')))

Keep only first five rows if group has more than five rows

Answers (2)

data

Related Questions