Reputation: 4065

How do I Difference Panel Data in R

I am wondering if there is any easy R commands or packages that will all allow me to easily add variables to data.frames which are the "difference" or change of over time of those variables.

If my data looked like this:

set.seed(1)
MyData <- data.frame(Day=0:9 %% 5+1, 
                 Price=rpois(10,10),
                 Good=rep(c("apples","oranges"), each=5))
MyData

   Day Price    Good
1    1     8  apples
2    2    10  apples
3    3     7  apples
4    4    11  apples
5    5    14  apples
6    1    12 oranges
7    2    11 oranges
8    3     9 oranges
9    4    14 oranges
10   5    11 oranges

Then after "first differencing" the price variable, my data would look like this.

   Day Price    Good P1d
1    1     8  apples  NA
2    2    10  apples   2
3    3     7  apples  -3
4    4    11  apples   4
5    5    14  apples   3
6    1    12 oranges  NA
7    2    11 oranges  -1
8    3     9 oranges  -2
9    4    14 oranges   5
10   5    11 oranges  -3

Upvotes: 6

Answers (7)

sondalex

Reputation: 45

If the data is unbalanced, examples such as askesis_rea's answer and G. Grothendieck's answer do not apply directly (Note: I did not test the other answers). But they apply after expanding the dataframe to a balanced panel with NA values.

Example

In this example, individuals are not observed across all time periods.

Let's extend the example to the unbalanced case: (I remove day 2 for apples)

set.seed(1)
mydata <- data.frame(Day=0:9 %% 5+1, 
                 Price=rpois(10,10),
                 Good=rep(c("apples","oranges"), each=5))
mydata <- mydata[!(mydata$Good=="apples" & mydata$Day==2), ]# removing apples in day 2
mydata

   Day Price    Good
1    1     8  apples
3    3     7  apples
4    4    11  apples
5    5    14  apples
6    1    12 oranges
7    2    11 oranges
8    3     9 oranges
9    4    14 oranges
10   5    11 oranges

Running G. Grothendieck's dplyr answer yields the wrong values:

mydata %>% 
  group_by(Good) %>% 
  mutate(P1d = Price - lag(Price)) %>% 
  ungroup

Indeed, for Day 3 and apples, the value should be 2 however it is -1. This is because the difference between prices in Day 3 and 1 was computed rather than the difference between Day 3 and 2.

# A tibble: 9 × 4
    Day Price Good      P1d
  <dbl> <int> <chr>   <int>
1     1     8 apples     NA
2     3     7 apples     -1
3     4    11 apples      4
4     5    14 apples      3
5     1    12 oranges    NA
6     2    11 oranges    -1
7     3     9 oranges    -2
8     4    14 oranges     5
9     5    11 oranges    -3

But if we expand first, and then apply first differencing, we get the right results:

library(tidyr)
expanded <- mydata %>%  complete(nesting(Good), Day=full_seq(Day, 1))

expanded %>% 
  group_by(Good) %>% 
  mutate(P1d = Price - lag(Price)) %>% 
  ungroup

# A tibble: 10 × 4
   Good      Day Price   P1d
   <chr>   <dbl> <int> <int>
 1 apples      1     8    NA
 2 apples      2    NA    NA
 3 apples      3     7    NA
 4 apples      4    11     4
 5 apples      5    14     3
 6 oranges     1    12    NA
 7 oranges     2    11    -1
 8 oranges     3     9    -2
 9 oranges     4    14     5
10 oranges     5    11    -3

Edit

In some other scenario of varying composition of individuals across time this method might not be appropriate. One good tool is fixest::d operator. It can be used in fixest model formula or with data.table.

Example with data.table

Unbalanced

library(data.table)
# creating the data
set.seed(1)
mydata <- data.frame(Day=0:9 %% 5+1, 
                 Price=rpois(10,10),
                 Good=rep(c("apples","oranges"), each=5))
mydata <- mydata[!(mydata$Good=="apples" & mydata$Day==2), ]# removing apples in day 2

mydata <- fixest::panel(as.data.table(mydata), panel.id=~Good + Day)
mydata[, P1D:=fixest::d(Price)] # Adding inplace first difference 

as.data.frame(fixest::unpanel(mydata)) # viewing

  Day Price    Good P1D
1   1     8  apples  NA
2   3     7  apples  NA
3   4    11  apples   4
4   5    14  apples   3
5   1    12 oranges  NA
6   2    11 oranges  -1
7   3     9 oranges  -2
8   4    14 oranges   5
9   5    11 oranges  -3

Balanced

set.seed(1)
MyData <- data.frame(Day=0:9 %% 5+1, 
                 Price=rpois(10,10),
                 Good=rep(c("apples","oranges"), each=5))

MyData <- fixest::panel(as.data.table(MyData), panel.id=~Good + Day)
MyData[, P1D:=fixest::d(Price)]
as.data.frame(fixest::unpanel(MyData))

[1] TRUE
   Day Price    Good P1D
1    1     8  apples  NA
2    2    10  apples   2
3    3     7  apples  -3
4    4    11  apples   4
5    5    14  apples   3
6    1    12 oranges  NA
7    2    11 oranges  -1
8    3     9 oranges  -2
9    4    14 oranges   5
10   5    11 oranges  -3

Upvotes: 0

Sebastian

Reputation: 1369

collapse::fdiff is the function you are looking for:

library(collapse)
# This means compute difference of Price lagged once, iterated once, by Good, ordered by Day
settransform(MyData, P1d = fdiff(Price, 1, 1, Good, Day))

Upvotes: 0

askesis_rea

Reputation: 21

In my case, I had to generate first differences for a panel. In order to have the differenced vector with the same length I used diff with a NA.

library(dplyr)
mydata %>%
group_by(id) %>%
mutate(price_diff = c(NA, diff(price)))%>%
ungroup()

Upvotes: 1

G. Grothendieck

Reputation: 270448

ave

transform(MyData, P1d = ave(Price, Good, FUN = function(x) c(NA, diff(x))))

ave/gsubfn

The last solution can be shorteneed slightly using fn$ in the gsubfn package:

library(gsubfn)
transform(MyData, P1d = fn$ave(Price, Good, FUN = ~ c(NA, diff(x))))

dplyr

library(dplyr)

MyData %>% 
  group_by(Good) %>% 
  mutate(P1d = Price - lag(Price)) %>% 
  ungroup

data.table

library(data.table)

dt <- data.table(MyData)
dt[, P1d := c(NA, diff(Price)), by = Good]

Update

dplyr now uses %>% instead of %.% .

Upvotes: 12

csqr

Reputation: 1

I came up with this code following some dpylr tutorials online: My goal was to add new column that is 5 year growth rate of R & D (variable xrd) of firms (identifer GVKEY). RandD2015 is raw data file.

By using the piping function (%>%) you can combine multiple calls in dplyr. Hope this is useful (my 1st code contribution in Stack Overflow)

library(dplyr)

RandDtest<- RandDec2015 %>% 
    group_by(GVKEY) %>%
    mutate(xrd5yr=xrd/lag(xrd,4)-1)

Upvotes: 0

Miha Trošt

Reputation: 2022

One can easily do it like this:

library(reshape2)
library(dplyr)

MyNewData <- 
 MyData %.%
 melt(id = c("Good", "Day")) %.%
 dcast(Day ~ Good) %.%
 mutate(apples  = apples - lag(apples),
     oranges = oranges - lag(oranges)) %.%
 melt(id = "Day", variable.name = "Good", value.name = "P1d") %.%
 merge(MyData) %.%
 arrange(Good, Day)

Regards

Upvotes: 2

Francis Smart

Reputation: 4065

This is what I came up with. But it does not seem efficient in the least:

MyData$P1d <- c(NA, MyData$Price[-1]-MyData$Price[-nrow(MyData)])
MyData$P1d[c(F,MyData$Good[-1]!=MyData$Good[-nrow(MyData)])] <- NA

MyData

   Day Price    Good P1d
1    1     8  apples  NA
2    2    10  apples   2
3    3     7  apples  -3
4    4    11  apples   4
5    5    14  apples   3
6    1    12 oranges  NA
7    2    11 oranges  -1
8    3     9 oranges  -2
9    4    14 oranges   5
10   5    11 oranges  -3

Upvotes: 0

How do I Difference Panel Data in R

Answers (7)

Example

Update

Related Questions