Extracting specific rows of grouped items conditionally in R

Question

I have the dataset below which is representative of a much larger dataset I am working with. The dataset below has 3 columns: a 'weight','feed',and 'flag'. Weight is a numerical value, feed is a factor variable, and flag is my indicator variable. Flag takes on a value of 0,1, or 2 (which will ultimately be random, but please use values provided below).

For what factor category in the 'feed' field, I want to examine the 'flag' field. If the 'flag' value = 2 & for the first occurrence of that value of 2 then I want to take the next 2 consecutive rows for that feed category - if 2 occurs in the last row for a particular feed category, then nothing happens.

For example, row 2 of the data frame where horsebean =160, the flag value is 2. I would mark row 3 & 4 since these are the next two rows that follow an occurrence of the value 2 for horsebean. I would need to be careful though in the event that horsebean flag value of 2 didn't occur until row 9 or 10 - I wouldn't want to accidentally select rows from linseed.

Ideally a 4th column is created with some sort of flag indicating they were the rows that followed a flag value of 2.

If anyone could help me formulate or attack this problem I would be eternally grateful.

df =    structure(list(weight = c(179, 160, 136, 227, 217, 168, 108, 
  124, 143, 140, 309, 229, 181, 141, 260, 203, 148, 169, 213, 257, 
  244, 271, 243, 230, 248, 327, 329, 250, 193, 271, 316, 267, 199, 
 171, 158, 248, 423, 340, 392, 339, 341, 226, 320, 295, 334, 322, 
 297, 318, 325, 257, 303, 315, 380, 153, 263, 242, 206, 344, 258, 
368, 390, 379, 260, 404, 318, 352, 359, 216, 222, 283, 332), 
feed = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), .Label = c("casein", "horsebean", "linseed", "meatmeal", 
"soybean", "sunflower"), class = "factor"), flag = structure(c(2L, 
3L, 3L, 3L, 3L, 1L, 2L, 2L, 2L, 1L, 1L, 3L, 2L, 3L, 2L, 3L, 
3L, 1L, 2L, 3L, 1L, 2L, 2L, 1L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 
3L, 3L, 1L, 2L, 1L, 1L, 3L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 
2L, 3L, 3L, 3L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 
1L, 3L, 1L, 2L, 3L, 3L, 3L, 1L, 1L, 2L), .Label = c("0", 
"1", "2"), class = "factor")), .Names = c("weight", "feed", 
 "flag"), row.names = c(NA, -71L), class = "data.frame")

Also suppose that the flag values are randomly generated and the flag value of 2 may or may not appear for each grouped value. For example if

  df$flag = sample(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), replace=T, size=nrow(df))

David Arenburg · Accepted Answer

Here's a simple data.table solution

library(data.table)
indx <- setDT(df)[, .I[match(2, flag) + seq_len(2)], by = feed]$V1
df[indx, flag2 := 1][]
#     weight      feed flag flag2
#  1:    179 horsebean    1    NA
#  2:    160 horsebean    2    NA
#  3:    136 horsebean    2     1
#  4:    227 horsebean    2     1
#  5:    217 horsebean    2    NA
#  6:    168 horsebean    0    NA
#  7:    108 horsebean    1    NA
#  8:    124 horsebean    1    NA
#  9:    143 horsebean    1    NA
# 10:    140 horsebean    0    NA
# ...

Why this will work? Because we use match in order to match the first 2 within .I (index) within each group. Thus if the index exceeds the size of a certain group it will return NA. Eventually index is just the indexes of the rows we want to select. Thus while sub-setting by this index, we are assigning a new flag which will tell us which rows we want.

For example, If you want to select only the rows within flag2, simply use na.omit by the flag2 column

na.omit(df, "flag2")
#     weight      feed flag flag2
#  1:    136 horsebean    2     1
#  2:    227 horsebean    2     1
#  3:    181   linseed    1     1
#  4:    141   linseed    2     1
#  5:    329   soybean    1     1
#  6:    250   soybean    2     1
#  7:    392 sunflower    1     1
#  8:    339 sunflower    1     1
#  9:    257  meatmeal    2     1
# 10:    303  meatmeal    1     1
# 11:    390    casein    2     1
# 12:    379    casein    0     1

Extracting specific rows of grouped items conditionally in R

Answers (1)

Related Questions