Nicolas Vermeulen
Nicolas Vermeulen

Reputation: 55

Keep only first five rows if group has more than five rows

I have a dataframe with USERID(thus we group by this), and other variables statuses and a date.

Some of these USERID's have more than 5 statuses, so we should keep only the 5 most recent ones,by date.

How should I code this, looks simple but I haven't manage to do so.

Upvotes: 0

Views: 69

Answers (2)

akrun
akrun

Reputation: 887108

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'USERID', we order the 'date' in decreasing (Assuming that the 'date' column is Date class) and get the first 5 rows with head

library(data.table)
setDT(df1)[order(-date), head(.SD, 5), by=USERID]

Or as @Symbolix mentioned in the comments, we could also make use of .I to get the row index and later remove the NA rows for groups that don't have 5 rows)

 na.omit(setDT(df1)[df1[order(-date), .I[1:5], by= USERID]$V1 ])

data

set.seed(49)
df1 <- data.frame(USERID= sample(LETTERS[1:3], 12, 
  replace=TRUE), date= sample(seq(as.Date('2014-01-01'), 
  length.out=12, by = '1 day')))

Upvotes: 2

SymbolixAU
SymbolixAU

Reputation: 26258

If you're a fan of dplyr you can do

library(dplyr)

df %>%
  group_by(USERID) %>%
  arrange(-date) %>%
  slice(1:5) %>%
  ungroup

On 'large' data sets the data.table approach will likely be faster, but dplyr has a slightly easier syntax to get your head around at first (in my opinion).

Upvotes: 2

Related Questions