Reputation: 6615
I have the dataset below which is representative of a much larger dataset I am working with. The dataset below has 3 columns: a 'weight','feed',and 'flag'. Weight is a numerical value, feed is a factor variable, and flag is my indicator variable. Flag takes on a value of 0,1, or 2 (which will ultimately be random, but please use values provided below).
For what factor category in the 'feed' field, I want to examine the 'flag' field. If the 'flag' value = 2 & for the first occurrence of that value of 2 then I want to take the next 2 consecutive rows for that feed category - if 2 occurs in the last row for a particular feed category, then nothing happens.
For example, row 2 of the data frame where horsebean =160, the flag value is 2. I would mark row 3 & 4 since these are the next two rows that follow an occurrence of the value 2 for horsebean. I would need to be careful though in the event that horsebean flag value of 2 didn't occur until row 9 or 10 - I wouldn't want to accidentally select rows from linseed.
Ideally a 4th column is created with some sort of flag indicating they were the rows that followed a flag value of 2.
If anyone could help me formulate or attack this problem I would be eternally grateful.
df = structure(list(weight = c(179, 160, 136, 227, 217, 168, 108,
124, 143, 140, 309, 229, 181, 141, 260, 203, 148, 169, 213, 257,
244, 271, 243, 230, 248, 327, 329, 250, 193, 271, 316, 267, 199,
171, 158, 248, 423, 340, 392, 339, 341, 226, 320, 295, 334, 322,
297, 318, 325, 257, 303, 315, 380, 153, 263, 242, 206, 344, 258,
368, 390, 379, 260, 404, 318, 352, 359, 216, 222, 283, 332),
feed = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("casein", "horsebean", "linseed", "meatmeal",
"soybean", "sunflower"), class = "factor"), flag = structure(c(2L,
3L, 3L, 3L, 3L, 1L, 2L, 2L, 2L, 1L, 1L, 3L, 2L, 3L, 2L, 3L,
3L, 1L, 2L, 3L, 1L, 2L, 2L, 1L, 2L, 3L, 2L, 3L, 2L, 2L, 2L,
3L, 3L, 1L, 2L, 1L, 1L, 3L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L,
2L, 3L, 3L, 3L, 2L, 3L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 3L, 3L,
1L, 3L, 1L, 2L, 3L, 3L, 3L, 1L, 1L, 2L), .Label = c("0",
"1", "2"), class = "factor")), .Names = c("weight", "feed",
"flag"), row.names = c(NA, -71L), class = "data.frame")
Also suppose that the flag values are randomly generated and the flag value of 2 may or may not appear for each grouped value. For example if
df$flag = sample(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), replace=T, size=nrow(df))
Upvotes: 0
Views: 139
Reputation: 92282
Here's a simple data.table
solution
library(data.table)
indx <- setDT(df)[, .I[match(2, flag) + seq_len(2)], by = feed]$V1
df[indx, flag2 := 1][]
# weight feed flag flag2
# 1: 179 horsebean 1 NA
# 2: 160 horsebean 2 NA
# 3: 136 horsebean 2 1
# 4: 227 horsebean 2 1
# 5: 217 horsebean 2 NA
# 6: 168 horsebean 0 NA
# 7: 108 horsebean 1 NA
# 8: 124 horsebean 1 NA
# 9: 143 horsebean 1 NA
# 10: 140 horsebean 0 NA
# ...
Why this will work? Because we use match
in order to match the first 2
within .I
(index) within each group. Thus if the index exceeds the size of a certain group it will return NA
. Eventually index
is just the indexes of the rows we want to select. Thus while sub-setting by this index, we are assigning a new flag which will tell us which rows we want.
For example, If you want to select only the rows within flag2
, simply use na.omit
by the flag2
column
na.omit(df, "flag2")
# weight feed flag flag2
# 1: 136 horsebean 2 1
# 2: 227 horsebean 2 1
# 3: 181 linseed 1 1
# 4: 141 linseed 2 1
# 5: 329 soybean 1 1
# 6: 250 soybean 2 1
# 7: 392 sunflower 1 1
# 8: 339 sunflower 1 1
# 9: 257 meatmeal 2 1
# 10: 303 meatmeal 1 1
# 11: 390 casein 2 1
# 12: 379 casein 0 1
Upvotes: 3