Reputation: 763
My starting condition is something like the df
data frame
df<-data.frame(id=c(rep(2, 3), rep(4, 2)), year=c(2005:2007, 2005:2006), event=c(1,0,0,0,1))
id year event
1 2 2005 1
2 2 2006 0
3 2 2007 0
4 4 2005 0
5 4 2006 1
I have a series of actors (identified through an id) who happen to experience an event in a certain year.
Here I am trying to build is a series of additional columns that describe a) the distance from events and b) whether such distance is observable.
This is what I would like to obtain.
id year event evm2 evm1 evp1 evp2 ndm2 ndm1 ndp1 ndp2
1 2 2005 1 0 0 0 0 1 1 0 0
2 2 2006 0 0 1 0 0 1 0 0 1
3 2 2007 0 1 0 0 0 0 0 1 1
4 4 2005 0 0 0 1 0 1 1 0 1
5 4 2006 1 0 0 0 0 1 0 1 1
event
equals 1 when there is an event in a certain year. evm1
equals 1 when an event is observable in the year before. Similarly, evp1
is 1 when the event is in the following year - the letters p
or m
stand for 'plus' and 'minus' and the numbers represent the distance in years from the event.
For some of these observations the distance is not observable because the available time window is too short. This is the case of df[1,]
for which we don't know if in the previous years an event took place or not. In such a case, ndm1
and ndm2
are coded 1. If we consider the case df[5,]
, it will be ndp1
(and ndp2
) to be coded 1.
ev
and nd
variables work exactly in the same way. But the former tells if at a certain distance there is an event or not and the latter reveals whether such a distance is actually observable.
I tried to accomplish this using the following nested for loops, but I didn't succeed.
lag<-c(-2, -1, 1, 2)
df2<-df
df2[,4:11]<-0
colnames(df2)<-c("id", "year", "event", "evm2", "evm1", "evp1", "evp2", "ndm2", "ndm1", "ndp1", "ndp2")
for (i in length(df2$id)) {
id<-df2[i,1]
yr<-df2[i,2]
sta<-3
sta2<-7
for (j in lag){
sta<-sta+1
sta2<-sta2+1
if !is.null(df2[df2$id==id & df2$year==yr+j])==TRUE {
rw<-which(df2[df2$id==id & df2$year==yr+j])
if (df2[rw,3]==1) df2[i, sta]==1
} else {
df2[i, sta2]==1
}
}
}
Do you see anything that may be responsible for the errors? I have been going mad for two days trying to make it work and I would be really thankful if you could help.
Upvotes: 3
Views: 172
Reputation: 89057
Following my comment, here is what I had in mind as a potential rewrite:
lag.it <- function(x, n = 0L) {
l <- length(x)
neg.lag <- min(max(0L, -n), l)
pos.lag <- min(max(0L, +n), l)
c(rep(NA, +neg.lag),
head(x, -neg.lag),
tail(x, -pos.lag),
rep(NA, +pos.lag))
}
library(plyr)
ddply(df, "id", transform,
evm2 = lag.it(event, -2),
evm1 = lag.it(event, -1),
evp1 = lag.it(event, +1),
evp2 = lag.it(event, +2))
# id year event evm2 evm1 evp1 evp2
# 1 2 2005 1 NA NA 0 0
# 2 2 2006 0 NA 1 0 NA
# 3 2 2007 0 1 0 NA NA
# 4 4 2005 0 NA NA 1 NA
# 5 4 2006 1 NA 0 NA NA
Notice how I use NA
s instead of using two sets of variables. While I'd recommend you keep it this way, you can easily get what you asked for by defining e.g. ndm2
as is.na(evm2)
then replace the NA
s by zeroes.
Upvotes: 3