Reputation: 846
I have a dataset akin to this
User Date Value
A 2012-01-01 4
A 2012-01-02 5
A 2012-01-03 6
A 2012-01-04 7
B 2012-01-01 2
B 2012-01-02 3
B 2012-01-03 4
B 2012-01-04 5
I want to create a lag of Value
, respecting User
.
User Date Value Value.lag
A 2012-01-01 4 NA
A 2012-01-02 5 4
A 2012-01-03 6 5
A 2012-01-04 7 6
B 2012-01-01 2 NA
B 2012-01-02 3 2
B 2012-01-03 4 3
B 2012-01-04 5 4
I've done it very inefficiently in a loop
df$value.lag1<-NA
levs<-levels(as.factor(df$User))
levs
for (i in 1:length(levs)) {
temper<- subset(df,User==as.numeric(levs[i]))
temper<- rbind(NA,temper[-nrow(temper),])
df$value.lag1[df$User==as.numeric(as.character(levs[i]))]<- temper
}
But this is very slow. I've looked at using by
and tapply
, but not figured out how to make them work.
I don't think XTS or TS will work because of the User element.
Any suggestions?
Upvotes: 13
Views: 18640
Reputation: 1369
The collapse package now available on CRAN provides the most general C/C++ based solution to (fully-identified) panel-lags, leads, differences and growth rates / log differences. It has the generic functions flag
, fdiff
and fgrowth
and associated lag / lead, difference and growth operators L
, F
, D
and G
. So to lag a panel dataset, it is sufficient to type:
L(data, n = 1, by = ~ idvar, t = ~ timevar, cols = 4:8)
which means: Compute 1 lag of columns 4 through 8 of data
, identified by idvar
and timevar
. Multiple ID and time-variables can be supplied i.e. ~ id1 + id2
, and sequences of lags and leads can also be computed on each column (i.e. n = -1:3
computes one lead and 3 lags). The same thing can also be done more programmatically with flag
:
flag(data[4:8], 1, data$idvar, data$timevar)
Both of these options compute below 1 millisecond on typical datasets (<30,000 obs.). Large data performance is similar to data.tables shift. Similar programming applies to differences fdiff / D
and growth rates fgrowth / G
. These functions are all S3 generic and have vector / time-series, matrix / ts-matrix, data.frame, as well as plm::pseries
and plm::pdata.frame
and grouped_df
methods. Thus they can be used together with plm
classes for panel data and with dplyr.
Upvotes: 0
Reputation: 81
I think the easiest way, especially considering doing further analysis, is to convert your data frame to pdata.frame
class from plm
package.
After the conversion from diff()
and lag()
operators can be used to create panel differences and lags.
df<-pdata.frame(df,index=c("id","date"))
df<-transform(df, l_value=lag(value,1))
Upvotes: 8
Reputation: 111
I stumbled over a similar problem and wrote a function.
#df needs to be a structured balanced paneldata set sorted by id and date
#OBS the function deletes the row where the NA value would have been.
df <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2,2,2,2),
date = c(1992, 1993, 1991, 1990, 1994, 1992, 1991
,1994,1990,1993),
value = c(4.1, 4.5, 3.3, 5.3, 3.0, 3.2, 5.2,5.3,3.4,5.6))
# sort paneldata set
library(dplyr)
df<-arrange(df,id,date)
#Function
# a=df
# b=colname of variable/variables that you want to lag
# q=number of lag years
# t=colname of date/time column
retraso<-function(a,b,q,t){
sto<-max(as.numeric(unique(a[[t]])))
sta<-min(as.numeric(unique(a[[t]])))
yo<-a[which(a[[t]]>=(sta+q)),]
la<-function(a,d,t,sto,sta){
ja<-data.frame(a[[d]],a[[t]])
colnames(ja)<-c(d,t)
ja<-ja[which(ja[[t]]<=(sto-q)),1]
return(ja)
}
for (i in 1:length(b)){
yo[[b[i]]] <-la(a,b[i],t,sto,sta)
}
return(yo)
}
#lag df 1 year
df<-retraso(df,"value",1,"date")
Upvotes: 2
Reputation: 133
For a panel without missing obs this is an intuitive solution:
df <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2),
date = c(1992, 1993, 1991, 1990, 1994, 1992, 1991),
value = c(4.1, 4.5, 3.3, 5.3, 3.0, 3.2, 5.2))
df<-df[with(df, order(id,date)), ] # sort by id and then by date
df$l_value=c(NA,df$value[-length(df$value)]) # create a new var with data displaced by 1 unit
df$l_value[df$id != c(NA, df$id[-length(df$id)])] =NA # NA data with different current and lagged id.
df
id date value l_value
4 1 1990 5.3 NA
3 1 1991 3.3 5.3
1 1 1992 4.1 3.3
2 1 1993 4.5 4.1
5 1 1994 3.0 4.5
7 2 1991 5.2 NA
6 2 1992 3.2 5.2
Upvotes: 2
Reputation: 2677
If you don't have gaps in the time variable , do
df %>% group_by(User) %>% mutate(value_lag = lag(value, order_by =Date)
If you do have gaps in the time variable, see this answer https://stackoverflow.com/a/26108191/3662288
Upvotes: 1
Reputation: 16277
Provided the table is ordered by User and Date, this can be done with zoo
. The trick is not to specify an index at this point.
library(zoo)
df <-read.table(text="User Date Value
A 2012-01-01 4
A 2012-01-02 5
A 2012-01-03 6
A 2012-01-04 7
B 2012-01-01 2
B 2012-01-02 3
B 2012-01-03 4
B 2012-01-04 5", header=TRUE, as.is=TRUE,sep = " ")
out <-zoo(df)
Value.lag <-lag(out,-1)[out$User==lag(out$User)]
res <-merge.zoo(out,Value.lag)
res <-res[,-(4:5)] # to remove extra columns
User.out Date.out Value.out Value.Value.lag
1 A 2012-01-01 4 <NA>
2 A 2012-01-02 5 4
3 A 2012-01-03 6 5
4 A 2012-01-04 7 6
5 B 2012-01-01 2 <NA>
6 B 2012-01-02 3 2
7 B 2012-01-03 4 3
8 B 2012-01-04 5 4
Upvotes: 0
Reputation: 18323
Similarly, you could use tapply
# Create Data
user = c(rep('A',4),rep('B',4))
date = rep(seq(as.Date('2012-01-01'),as.Date('2012-01-04'),1),2)
value = c(4:7,2:5)
df = data.frame(user,date,value)
# Get lagged values
df$value.lag = unlist(tapply(df$value, df$user, function(x) c(NA,x[-length(df$value)])))
The idea is exactly the same: take value, split it by user, and then run a function on each subset. The unlist brings it back into vector format.
Upvotes: 0
Reputation: 32351
You can use ddply
: it cuts a data.frame into pieces and transforms each piece.
d <- data.frame(
User = rep( LETTERS[1:3], each=10 ),
Date = seq.Date( Sys.Date(), length=30, by="day" ),
Value = rep(1:10, 3)
)
library(plyr)
d <- ddply(
d, .(User), transform,
# This assumes that the data is sorted
Value = c( NA, Value[-length(Value)] )
)
Upvotes: 9