Reputation: 191

Panel data: Calculate group means while omitting first period from calculation

I have an issue regarding a certain kind of mean() calculation. I use a panel data set with two indentifiers "ID" and "year" (using the plm pkg)

I want to calculate the groupwise mean of a variable "y", but omit the first year's entry of the calculation and then only fill in the calculated mean only in the years that were used to calculate it. In other words, I want to have NA in every ID's first entry of this variable.

The panel data is unbalanced, so people come and go at different points in time. Some stay from beginning till end, for others I just have data for three 3 years.

library(tidyverse)
library(plm)

ID <- c("a","a","a","a","a","b","b","b","b","c","c","c")
y <- c(9,2,5,3,3,9,1,2,3,9,2,5)
year<- c(2001,2002,2003,2004,2005,2001,2002,2003,2004,2002,2003,2004)
dt <- data.frame(ID,y,year)
dt <- pdata.frame(dt, index = c("ID","year"))

I first tried a filter over periods like so:

dt <- dt %>% group_by(ID) %>% 
  filter(year %in% first(year)+1:last(year)) %>% 
  mutate(mean.y = mean(y))

But that doesn't work, and I am not surprised to be honest but I hope you know what I want to achieve. The final result should look like this:

See how the first entry of variable y = 9 for "a-2001" is left out so that it doesnt affect the mean of individual a's other y entries (2+5+3+3)/4

i hope you people could understand it. I would massively appreciate any help. Bye

Upvotes: 3

Answers (2)

ekoam

Reputation: 8844

Here is a dplyr solution. You can calculate the mean of all values except for the first one and then use is.na<- function to assign the first element of mean.y as NA.

library(dplyr)
dt %>% group_by(ID) %>% mutate(mean.y = mean(y[-1L]), mean.y = `is.na<-`(mean.y, 1L))

Output

# A tibble: 12 x 4
# Groups:   ID [3]
   ID        y  year mean.y
   <chr> <dbl> <dbl>  <dbl>
 1 a         9  2001  NA   
 2 a         2  2002   3.25
 3 a         5  2003   3.25
 4 a         3  2004   3.25
 5 a         3  2005   3.25
 6 b         9  2001  NA   
 7 b         1  2002   2   
 8 b         2  2003   2   
 9 b         3  2004   2   
10 c         9  2002  NA   
11 c         2  2003   3.5 
12 c         5  2004   3.5

More compactly,

dt %>% group_by(ID) %>% mutate(mean.y = mean(y[-1L])[n():1 %/% n() + 1L])

Upvotes: 2

TimTeaFan

Reputation: 18561

We could work with an ifelse inside mutate. Its more code, but I think its quite readable and easy to understand whats going on.

library(tidyverse)
library(plm)

dt %>% 
  group_by(ID) %>% 
  mutate(mean.y = ifelse(year == first(year),
                         NA,
                         mean(y[year != first(year)], na.rm = TRUE)))

#> # A tibble: 12 x 4
#> # Groups:   ID [3]
#>    ID        y year  mean.y
#>    <fct> <dbl> <fct>  <dbl>
#>  1 a         9 2001   NA   
#>  2 a         2 2002    3.25
#>  3 a         5 2003    3.25
#>  4 a         3 2004    3.25
#>  5 a         3 2005    3.25
#>  6 b         9 2001   NA   
#>  7 b         1 2002    2   
#>  8 b         2 2003    2   
#>  9 b         3 2004    2   
#> 10 c         9 2002   NA   
#> 11 c         2 2003    3.5 
#> 12 c         5 2004    3.5

^{Created on 2022-01-23 by the reprex package (v0.3.0)}

Upvotes: 2

Panel data: Calculate group means while omitting first period from calculation

Answers (2)

Related Questions