How to take first 5 rows by a group without replacement in another variable value

Question

I am trying to figure out how I can take only first 5 rows by a group without replacement in another variable value. For example, if the existing data table (or frame) looks like this:

But I just want to get first 5 rows for each group but without replacement in V1 values across the groups. So the result table I want is...:

I have been trying to do this using for loop by going through each id one at a time ....taking first 5 rows for each id and excluding the following rows with V1 values in the previous ids. But as my data is really big (the number of ids is over a million), it takes forever for the for loop to go through all the ids.

Is there anyone smarter than me to help me to find a better, more efficient and clever way to deal with this problem? Thanks much!

talat · Accepted Answer

Here's an option in three steps:

# create a vector to store set values
x <- numeric()
# compute the values by id and update x in the process
res <- lapply(split(df$V1, df$id), function(y) {
     y <- head(setdiff(y, x), 5)
     x <<- union(x, y)
     if(!length(y)) NA else y
})
# combine the result to data.frame
stack(res)
#   values ind
#1     101   1
#2     102   1
#3     103   1
#4     104   1
#5     105   1
#6     107   2
#7     108   2
#8     109   2
#9     110   2
#10    111   2
#11     NA   3

How to take first 5 rows by a group without replacement in another variable value

Answers (2)

Related Questions