Daniel Egan
Daniel Egan

Reputation: 846

general lag in time series panel data

I have a dataset akin to this

User    Date        Value
A       2012-01-01  4
A       2012-01-02  5   
A       2012-01-03  6
A       2012-01-04  7
B       2012-01-01  2
B       2012-01-02  3   
B       2012-01-03  4
B       2012-01-04  5

I want to create a lag of Value, respecting User.

User    Date        Value   Value.lag
A       2012-01-01  4       NA
A       2012-01-02  5       4
A       2012-01-03  6       5
A       2012-01-04  7       6
B       2012-01-01  2       NA
B       2012-01-02  3       2   
B       2012-01-03  4       3
B       2012-01-04  5       4

I've done it very inefficiently in a loop

df$value.lag1<-NA
levs<-levels(as.factor(df$User))
levs
  for (i in 1:length(levs)) {
    temper<- subset(df,User==as.numeric(levs[i]))
    temper<- rbind(NA,temper[-nrow(temper),])  
df$value.lag1[df$User==as.numeric(as.character(levs[i]))]<- temper
      }

But this is very slow. I've looked at using by and tapply, but not figured out how to make them work.

I don't think XTS or TS will work because of the User element.

Any suggestions?

Upvotes: 13

Views: 18640

Answers (8)

Sebastian
Sebastian

Reputation: 1369

The collapse package now available on CRAN provides the most general C/C++ based solution to (fully-identified) panel-lags, leads, differences and growth rates / log differences. It has the generic functions flag, fdiff and fgrowth and associated lag / lead, difference and growth operators L, F, D and G. So to lag a panel dataset, it is sufficient to type:

L(data, n = 1, by = ~ idvar, t = ~ timevar, cols = 4:8)

which means: Compute 1 lag of columns 4 through 8 of data, identified by idvar and timevar. Multiple ID and time-variables can be supplied i.e. ~ id1 + id2, and sequences of lags and leads can also be computed on each column (i.e. n = -1:3 computes one lead and 3 lags). The same thing can also be done more programmatically with flag:

flag(data[4:8], 1, data$idvar, data$timevar)

Both of these options compute below 1 millisecond on typical datasets (<30,000 obs.). Large data performance is similar to data.tables shift. Similar programming applies to differences fdiff / D and growth rates fgrowth / G. These functions are all S3 generic and have vector / time-series, matrix / ts-matrix, data.frame, as well as plm::pseries and plm::pdata.frame and grouped_df methods. Thus they can be used together with plm classes for panel data and with dplyr.

Upvotes: 0

Andriy Levitskyy
Andriy Levitskyy

Reputation: 81

I think the easiest way, especially considering doing further analysis, is to convert your data frame to pdata.frame class from plm package.

After the conversion from diff() and lag() operators can be used to create panel differences and lags.

df<-pdata.frame(df,index=c("id","date"))  
df<-transform(df, l_value=lag(value,1))   

Upvotes: 8

Niltzable
Niltzable

Reputation: 111

I stumbled over a similar problem and wrote a function.

#df needs to be a structured balanced paneldata set sorted by id and date
#OBS the function deletes the row where the NA value would have been.

df <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2,2,2,2), 
                 date = c(1992, 1993, 1991, 1990, 1994, 1992, 1991
                          ,1994,1990,1993), 
                 value = c(4.1, 4.5, 3.3, 5.3, 3.0, 3.2, 5.2,5.3,3.4,5.6))
# sort paneldata set
library(dplyr)
df<-arrange(df,id,date)

#Function
# a=df
# b=colname of variable/variables that you want to lag
# q=number of lag years
# t=colname of date/time column
retraso<-function(a,b,q,t){

  sto<-max(as.numeric(unique(a[[t]])))
  sta<-min(as.numeric(unique(a[[t]])))

  yo<-a[which(a[[t]]>=(sta+q)),]

  la<-function(a,d,t,sto,sta){


    ja<-data.frame(a[[d]],a[[t]])
    colnames(ja)<-c(d,t)


    ja<-ja[which(ja[[t]]<=(sto-q)),1]
    return(ja)
  }

  for (i in 1:length(b)){

    yo[[b[i]]] <-la(a,b[i],t,sto,sta)

    }
    return(yo)

  }

#lag df 1 year
df<-retraso(df,"value",1,"date")

Upvotes: 2

Fix.B.
Fix.B.

Reputation: 133

For a panel without missing obs this is an intuitive solution:

df <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2), 
                 date = c(1992, 1993, 1991, 1990, 1994, 1992, 1991), 
                 value = c(4.1, 4.5, 3.3, 5.3, 3.0, 3.2, 5.2))

df<-df[with(df, order(id,date)), ]  # sort by id and then by date
df$l_value=c(NA,df$value[-length(df$value)]) # create a new var with data displaced by 1 unit
df$l_value[df$id != c(NA, df$id[-length(df$id)])] =NA # NA data with different current and lagged id.
df

id date value l_value
4  1 1990   5.3      NA
3  1 1991   3.3     5.3
1  1 1992   4.1     3.3
2  1 1993   4.5     4.1
5  1 1994   3.0     4.5
7  2 1991   5.2      NA
6  2 1992   3.2     5.2

Upvotes: 2

Matthew
Matthew

Reputation: 2677

If you don't have gaps in the time variable , do

df %>% group_by(User) %>% mutate(value_lag = lag(value, order_by =Date)

If you do have gaps in the time variable, see this answer https://stackoverflow.com/a/26108191/3662288

Upvotes: 1

Pierre Lapointe
Pierre Lapointe

Reputation: 16277

Provided the table is ordered by User and Date, this can be done with zoo. The trick is not to specify an index at this point.

library(zoo)
df <-read.table(text="User Date Value
A 2012-01-01 4
A 2012-01-02 5
A 2012-01-03 6
A 2012-01-04 7
B 2012-01-01 2
B 2012-01-02 3
B 2012-01-03 4
B 2012-01-04 5", header=TRUE, as.is=TRUE,sep = " ")

out <-zoo(df)

Value.lag <-lag(out,-1)[out$User==lag(out$User)]
res <-merge.zoo(out,Value.lag)
res <-res[,-(4:5)]  # to remove extra columns

  User.out Date.out   Value.out Value.Value.lag
1 A        2012-01-01 4         <NA>           
2 A        2012-01-02 5         4              
3 A        2012-01-03 6         5              
4 A        2012-01-04 7         6              
5 B        2012-01-01 2         <NA>           
6 B        2012-01-02 3         2              
7 B        2012-01-03 4         3              
8 B        2012-01-04 5         4 

Upvotes: 0

nograpes
nograpes

Reputation: 18323

Similarly, you could use tapply

# Create Data
user = c(rep('A',4),rep('B',4))
date = rep(seq(as.Date('2012-01-01'),as.Date('2012-01-04'),1),2)
value = c(4:7,2:5) 
df = data.frame(user,date,value)
# Get lagged values
df$value.lag = unlist(tapply(df$value, df$user, function(x) c(NA,x[-length(df$value)])))

The idea is exactly the same: take value, split it by user, and then run a function on each subset. The unlist brings it back into vector format.

Upvotes: 0

Vincent Zoonekynd
Vincent Zoonekynd

Reputation: 32351

You can use ddply: it cuts a data.frame into pieces and transforms each piece.

d <- data.frame( 
  User = rep( LETTERS[1:3], each=10 ),
  Date = seq.Date( Sys.Date(), length=30, by="day" ),
  Value = rep(1:10, 3)
)
library(plyr)
d <- ddply( 
  d, .(User), transform,
  # This assumes that the data is sorted
  Value = c( NA, Value[-length(Value)] ) 
)

Upvotes: 9

Related Questions