Reputation: 854
Suppose I have a data frame that looks like this:
ID T X Y Z
1 1 A A NA
1 2 B A NA
1 3 B B NA
1 4 B A NA
2 1 A B NA
2 2 A A NA
2 3 B A NA
2 4 A B NA
3 1 B B NA
3 2 B B NA
3 3 B B NA
3 4 B A NA
And I would like to replace the value of Z based on some conditionals that depend on both row and (previous) column values so that the above ends up looking like this:
ID T X Y Z
1 1 A A 0
1 2 B A 0
1 3 B B 1
1 4 B A NA
2 1 A B 0
2 2 A A 0
2 3 B A 0
2 4 A B 0
3 1 B B 1
3 2 B B NA
3 3 B B NA
3 4 B A NA
The rules:
I want the following to work, and it gets me kinda close but no dice:
df$Z <- NA
for (t in 1:4) {
df$Z[ (df$X=="B" & df$Y=="B") & df$T==1] <- 1
df$Z[!(df$X=="B" & df$Y=="B") & df$T==1] <- 0
if (t>1) {
df$Z[ (df$X=="B" & df$Y=="B") & df$T==t & (!is.na(df$Z[t-1]) & df$Z[t-1]==0)] <- 0
df$Z[!(df$X=="B" & df$Y=="B") & df$T==t & (!is.na(df$Z[t-1]) & df$Z[t-1]==0)] <- 1
}
}
On the other hand, I can write series of nested if... then
statements looping across all observations, but that is excruciatingly slow (at least, compared to the program I am translating from on Stata).
I am sure I am committing twelve kinds of gaffes in my attempt above, but a few hours of banging my head on this has not resolved it.
So I come to you begging, hat in hand. :)
Edit: it occurs to me that sharing the Stata code (which resolves this so much faster than what I have come up with in R, which is ironic, given my preference for R over Stata's language :) might help with suggestions. This does what I want, and does it fast (even with, say, N=1600, T=11):
replace Z = .
forvalues t = 1(1)4 {
replace Z = 1 if X == "B" & Y == "B" & T == 1
replace Z = 0 if X == "B" & Y == "B" & T == 1
replace Z = 1 if X == "B" & Y == "B" & T == `t' & Z[_n-1] == 0 & `t' > 1
replace Z = 0 if X == "B" & Y == "B" & T == `t' & Z[_n-1] == 0 & `t' > 1
}
Upvotes: 3
Views: 2222
Reputation: 67778
Another possibillity using by
ll <- by(df, df$ID, function(x){
x$Z <- cumsum(cumsum(x$X == "B" & x$Y == "B"))
x$Z[x$Z > 1] <- NA
x
})
df2 <- do.call(rbind, ll)
df2
# ID T X Y Z
# 1.1 1 1 A A 0
# 1.2 1 2 B A 0
# 1.3 1 3 B B 1
# 1.4 1 4 B A NA
# 2.5 2 1 A B 0
# 2.6 2 2 A A 0
# 2.7 2 3 B A 0
# 2.8 2 4 A B 0
# 3.9 3 1 B B 1
# 3.10 3 2 B B NA
# 3.11 3 3 B B NA
# 3.12 3 4 B A NA
Same function but using ddply
instead:
library(plyr)
df2 <- ddply(.data = df, .variables = .(ID), function(x){
x$Z <- cumsum(cumsum(x$X == "B" & x$Y == "B"))
x$Z[x$Z > 1] <- NA
x
})
df2
Upvotes: 1
Reputation: 81713
Here's one approach using ave
and transform
:
transform(dat[order(dat$ID, dat$T), ],
Z = ave(X == "B" & Y == "B", ID, FUN = function(x) {
as.integer("is.na<-"(x, (duplicated(x) & cumsum(x)) |
c(0, diff(x)) < 0)) }))
# ID T X Y Z
# 1 1 1 A A 0
# 2 1 2 B A 0
# 3 1 3 B B 1
# 4 1 4 B A NA
# 5 2 1 A B 0
# 6 2 2 A A 0
# 7 2 3 B A 0
# 8 2 4 A B 0
# 9 3 1 B B 1
# 10 3 2 B B NA
# 11 3 3 B B NA
# 12 3 4 B A NA
where dat
is the name of your data frame. The reordering (dat[order(dat$ID, dat$T), ]
) is not necessary if the rows are already ordered along ID
and T
.
Upvotes: 2