Conditional column creation (horizontal and vertical conditions)

Question

My starting condition is something like the df data frame

df<-data.frame(id=c(rep(2, 3), rep(4, 2)), year=c(2005:2007, 2005:2006), event=c(1,0,0,0,1))

  id year event
1  2 2005     1
2  2 2006     0
3  2 2007     0
4  4 2005     0
5  4 2006     1

I have a series of actors (identified through an id) who happen to experience an event in a certain year.

Here I am trying to build is a series of additional columns that describe a) the distance from events and b) whether such distance is observable.

This is what I would like to obtain.

   id year event evm2 evm1 evp1 evp2 ndm2 ndm1 ndp1 ndp2
1  2 2005     1    0    0    0    0    1    1    0    0
2  2 2006     0    0    1    0    0    1    0    0    1
3  2 2007     0    1    0    0    0    0    0    1    1
4  4 2005     0    0    0    1    0    1    1    0    1
5  4 2006     1    0    0    0    0    1    0    1    1

event equals 1 when there is an event in a certain year. evm1 equals 1 when an event is observable in the year before. Similarly, evp1 is 1 when the event is in the following year - the letters p or m stand for 'plus' and 'minus' and the numbers represent the distance in years from the event. For some of these observations the distance is not observable because the available time window is too short. This is the case of df[1,] for which we don't know if in the previous years an event took place or not. In such a case, ndm1 and ndm2 are coded 1. If we consider the case df[5,], it will be ndp1 (and ndp2) to be coded 1. ev and nd variables work exactly in the same way. But the former tells if at a certain distance there is an event or not and the latter reveals whether such a distance is actually observable.

I tried to accomplish this using the following nested for loops, but I didn't succeed.

lag<-c(-2, -1, 1, 2)
df2<-df
df2[,4:11]<-0
colnames(df2)<-c("id", "year", "event", "evm2",  "evm1",  "evp1",  "evp2",  "ndm2",  "ndm1",  "ndp1",  "ndp2") 


for (i in length(df2$id)) {

  id<-df2[i,1]
  yr<-df2[i,2]
  sta<-3
  sta2<-7

  for (j in lag){

    sta<-sta+1
    sta2<-sta2+1

    if !is.null(df2[df2$id==id & df2$year==yr+j])==TRUE {

      rw<-which(df2[df2$id==id & df2$year==yr+j])

      if (df2[rw,3]==1) df2[i, sta]==1

    } else {

      df2[i, sta2]==1

    }

  }

}

Do you see anything that may be responsible for the errors? I have been going mad for two days trying to make it work and I would be really thankful if you could help.

flodel · Accepted Answer

Following my comment, here is what I had in mind as a potential rewrite:

lag.it <- function(x, n = 0L) {
  l <- length(x)
  neg.lag <- min(max(0L, -n), l)
  pos.lag <- min(max(0L, +n), l)
  c(rep(NA, +neg.lag),
    head(x, -neg.lag),
    tail(x, -pos.lag),
    rep(NA, +pos.lag))
}

library(plyr)
ddply(df, "id", transform,
      evm2 = lag.it(event, -2),
      evm1 = lag.it(event, -1),
      evp1 = lag.it(event, +1),
      evp2 = lag.it(event, +2))

#   id year event evm2 evm1 evp1 evp2
# 1  2 2005     1   NA   NA    0    0
# 2  2 2006     0   NA    1    0   NA
# 3  2 2007     0    1    0   NA   NA
# 4  4 2005     0   NA   NA    1   NA
# 5  4 2006     1   NA    0   NA   NA

Notice how I use NAs instead of using two sets of variables. While I'd recommend you keep it this way, you can easily get what you asked for by defining e.g. ndm2 as is.na(evm2) then replace the NAs by zeroes.

Conditional column creation (horizontal and vertical conditions)

Answers (1)

Related Questions