Reputation: 97
I think it's time to ask for help. Suppose I have this data.frame or data.table
State Date Event
CA Oct27 1
CA Oct28 0
CA Oct29 0
CA Oct30 0
CA Oct31 1
TX Oct27 0
TX Oct28 1
TX Oct29 1
TX Oct30 0
TX Oct31 0
TX Nov1 0
I want to create a new binary variable, "active", that indicates whether there is an active event on a particular date and state (assuming that all events lasted three days). The value "1" in column "Event" indicates when the event started. So, my data my would look like this:
State Date Event Active
CA Oct27 1 1
CA Oct28 0 1
CA Oct29 0 1
CA Oct30 0 0
CA Oct31 1 1
TX Oct27 0 0
TX Oct28 1 1
TX Oct29 1 1
TX Oct30 0 1
TX Oct31 0 1
TX Nov1 0 0
I'd appreciate any suggestions.
Upvotes: 3
Views: 1744
Reputation: 28461
I like the data.table
solution. Here's what I think is a cleaner base R solution.
s <- split(df, df$State)
newlist <- lapply(s, function(x) {
days <- c(which(x$Event==1)+1, which(x$Event==1)+2)
x$Event[seq_along(x$Event) %in% days] <- 1
x
}
)
do.call(rbind, newlist)
First, split the data frame by State. For each state, identify the two days after an event starts. If those days are in the list, assign 1
to them. Lastly, put the states together.
It outputs:
State Date Event
CA.1 CA Oct27 1
CA.2 CA Oct28 1
CA.3 CA Oct29 1
CA.4 CA Oct30 0
CA.5 CA Oct31 1
TX.6 TX Oct27 0
TX.7 TX Oct28 1
TX.8 TX Oct29 1
TX.9 TX Oct30 1
TX.10 TX Oct31 1
TX.11 TX Nov1 0
Upvotes: 1
Reputation: 35324
Dude, this was a seriously challenging problem. I think I got it using by()
to group by State
and Reduce()
to repeatedly apply vectorized logical OR |
to the Active
vector to account for any past day within the specified range (3) that had an event start.
df <- data.frame(State=c('CA','CA','CA','CA','CA','TX','TX','TX','TX','TX','TX'), Date=c('Oct27','Oct28','Oct29','Oct30','Oct31','Oct27','Oct28','Oct29','Oct30','Oct31','Nov1'), Event=c(1,0,0,0,1,0,1,1,0,0,0) );
E <- 3;
do.call(rbind,by(df,df$State,function(x) { s <- x$Event==1; x$Active <- Reduce(function(a,b) a|c(rep(F,b),s[-seq(length(s)-b+1,len=b)]),c(list(s),1:(E-1))); x; }));
## State Date Event Active
## CA.1 CA Oct27 1 TRUE
## CA.2 CA Oct28 0 TRUE
## CA.3 CA Oct29 0 TRUE
## CA.4 CA Oct30 0 FALSE
## CA.5 CA Oct31 1 TRUE
## TX.6 TX Oct27 0 FALSE
## TX.7 TX Oct28 1 TRUE
## TX.8 TX Oct29 1 TRUE
## TX.9 TX Oct30 0 TRUE
## TX.10 TX Oct31 0 TRUE
## TX.11 TX Nov1 0 FALSE
An advantage of this solution is that it parameterizes the event duration, which means you can change it easily in the future:
E <- 2;
do.call(rbind,by(df,df$State,function(x) { s <- x$Event==1; x$Active <- Reduce(function(a,b) a|c(rep(F,b),s[-seq(length(s)-b+1,len=b)]),c(list(s),1:(E-1))); x; }));
## State Date Event Active
## CA.1 CA Oct27 1 TRUE
## CA.2 CA Oct28 0 TRUE
## CA.3 CA Oct29 0 FALSE
## CA.4 CA Oct30 0 FALSE
## CA.5 CA Oct31 1 TRUE
## TX.6 TX Oct27 0 FALSE
## TX.7 TX Oct28 1 TRUE
## TX.8 TX Oct29 1 TRUE
## TX.9 TX Oct30 0 TRUE
## TX.10 TX Oct31 0 FALSE
## TX.11 TX Nov1 0 FALSE
The correctness of this solution depends on two assumptions, independently for each unique State
: (1) there are no gaps in the Date
sequence, and (2) the data.frame is ordered by Date
.
Here's a different solution using by()
again, but now with seq()
to generate all dates covered by an event, and merge()
to merge those dates back into the data.frame subset for a particular State
to set Active
to true. This solution relaxes both of the assumptions I mentioned above; the input data.frame now no longer has to be gapless or ordered. However, you now must coerce the Date
column to class Date
(as done in my demo below), although I would argue that's something that should always be done when you're working with dates.
df2 <- transform(df,Date=as.Date(Date,'%b%d'));
E <- 3;
transform(do.call(rbind,by(df2,df2$State,function(x) merge(x,data.frame(Date=unique(do.call(c,lapply(x$Date[x$Event==1],seq,by=1,len=E))),Active=T),all.x=T))),Active=replace(Active,is.na(Active),F));
## Date State Event Active
## CA.1 2015-10-27 CA 1 TRUE
## CA.2 2015-10-28 CA 0 TRUE
## CA.3 2015-10-29 CA 0 TRUE
## CA.4 2015-10-30 CA 0 FALSE
## CA.5 2015-10-31 CA 1 TRUE
## TX.1 2015-10-27 TX 0 FALSE
## TX.2 2015-10-28 TX 1 TRUE
## TX.3 2015-10-29 TX 1 TRUE
## TX.4 2015-10-30 TX 0 TRUE
## TX.5 2015-10-31 TX 0 TRUE
## TX.6 2015-11-01 TX 0 FALSE
E <- 2;
transform(do.call(rbind,by(df2,df2$State,function(x) merge(x,data.frame(Date=unique(do.call(c,lapply(x$Date[x$Event==1],seq,by=1,len=E))),Active=T),all.x=T))),Active=replace(Active,is.na(Active),F));
## Date State Event Active
## CA.1 2015-10-27 CA 1 TRUE
## CA.2 2015-10-28 CA 0 TRUE
## CA.3 2015-10-29 CA 0 FALSE
## CA.4 2015-10-30 CA 0 FALSE
## CA.5 2015-10-31 CA 1 TRUE
## TX.1 2015-10-27 TX 0 FALSE
## TX.2 2015-10-28 TX 1 TRUE
## TX.3 2015-10-29 TX 1 TRUE
## TX.4 2015-10-30 TX 0 TRUE
## TX.5 2015-10-31 TX 0 FALSE
## TX.6 2015-11-01 TX 0 FALSE
Upvotes: 2
Reputation: 31181
Considering your table is sorted and you do not care about non adjacent days, you can try:
library(data.table)
setDT(df)[, Active:=Event|c(0, head(Event,-1))|c(0,0,head(Event,-2)), State][
, Active:=Active+0]
# State Date Event Active
# 1: CA Oct27 1 1
# 2: CA Oct28 0 1
# 3: CA Oct29 0 1
# 4: CA Oct30 0 0
# 5: CA Oct31 1 1
# 6: TX Oct27 0 0
# 7: TX Oct28 1 1
# 8: TX Oct29 1 1
# 9: TX Oct30 0 1
#10: TX Oct31 0 1
#11: TX Nov1 0 0
Upvotes: 3