How to fill data frames in a manner dependent on values in other rows and columns in R

Question

Suppose I have a data frame that looks like this:

ID   T  X  Y  Z
1    1  A  A  NA
1    2  B  A  NA
1    3  B  B  NA
1    4  B  A  NA
2    1  A  B  NA
2    2  A  A  NA
2    3  B  A  NA
2    4  A  B  NA
3    1  B  B  NA
3    2  B  B  NA
3    3  B  B  NA
3    4  B  A  NA

And I would like to replace the value of Z based on some conditionals that depend on both row and (previous) column values so that the above ends up looking like this:

ID   T  X  Y  Z
1    1  A  A  0
1    2  B  A  0
1    3  B  B  1
1    4  B  A  NA
2    1  A  B  0
2    2  A  A  0
2    3  B  A  0
2    4  A  B  0
3    1  B  B  1
3    2  B  B  NA
3    3  B  B  NA
3    4  B  A  NA

The rules:

Z takes the value of 1 the first time (in order by T, and within an ID) that both X and Y one that row have the value B.
Z takes (or retains) the value NA if and only if for any smaller value of T, it has taken the value of 1 already.
When T = 1, Z takes the value of 0 if X and Y on that row do not both equal B.
When T > 1, Z takes the value of 0 if X and Y on that row do not both equal B, AND the value of Z on the previous row = zero.

I want the following to work, and it gets me kinda close but no dice:

df$Z <- NA
for (t in 1:4) {
  df$Z[ (df$X=="B" & df$Y=="B") & df$T==1] <- 1
  df$Z[!(df$X=="B" & df$Y=="B") & df$T==1] <- 0
  if (t>1) {
    df$Z[ (df$X=="B" & df$Y=="B") & df$T==t & (!is.na(df$Z[t-1]) & df$Z[t-1]==0)] <- 0
    df$Z[!(df$X=="B" & df$Y=="B") & df$T==t & (!is.na(df$Z[t-1]) & df$Z[t-1]==0)] <- 1
  }
}

On the other hand, I can write series of nested if... then statements looping across all observations, but that is excruciatingly slow (at least, compared to the program I am translating from on Stata).

I am sure I am committing twelve kinds of gaffes in my attempt above, but a few hours of banging my head on this has not resolved it.

So I come to you begging, hat in hand. :)

Edit: it occurs to me that sharing the Stata code (which resolves this so much faster than what I have come up with in R, which is ironic, given my preference for R over Stata's language :) might help with suggestions. This does what I want, and does it fast (even with, say, N=1600, T=11):

replace Z = .
forvalues t = 1(1)4 {
  replace Z = 1 if X == "B" & Y == "B" & T == 1
  replace Z = 0 if X == "B" & Y == "B" & T == 1
  replace Z = 1 if X == "B" & Y == "B" & T == `t' & Z[_n-1] == 0 & `t' > 1
  replace Z = 0 if X == "B" & Y == "B" & T == `t' & Z[_n-1] == 0 & `t' > 1
  }

Sven Hohenstein · Accepted Answer

Here's one approach using ave and transform:

transform(dat[order(dat$ID, dat$T), ],
          Z = ave(X == "B" & Y == "B", ID, FUN = function(x) {
            as.integer("is.na<-"(x, (duplicated(x) & cumsum(x)) | 
                                    c(0, diff(x)) < 0)) }))

#    ID T X Y  Z
# 1   1 1 A A  0
# 2   1 2 B A  0
# 3   1 3 B B  1
# 4   1 4 B A NA
# 5   2 1 A B  0
# 6   2 2 A A  0
# 7   2 3 B A  0
# 8   2 4 A B  0
# 9   3 1 B B  1
# 10  3 2 B B NA
# 11  3 3 B B NA
# 12  3 4 B A NA

where dat is the name of your data frame. The reordering (dat[order(dat$ID, dat$T), ]) is not necessary if the rows are already ordered along ID and T.

How to fill data frames in a manner dependent on values in other rows and columns in R

Answers (2)

Related Questions