Alex
Alex

Reputation: 2780

Multiple grouped differences in dplyr

Goal

Currently I only report means for calculations which I show below, but I would like to add confidence intervals.

If I have the data in the correct format it would not be had for me to use linear regressionlm() to calculate estimated grouped differences and their intervals, but I am having difficulty getting the data in the correct format.

Here is some data:

fake data

> set.seed(909)
> d2017pre <- tibble(n = rnorm(25, mean = 1100, sd = 10),period = "pre", year = 2017)
> d2016pre <- tibble(n = rnorm(25, mean = 1500, sd = 10),period = "pre", year = 2016)
> d2017post <- tibble(n = rnorm(25, mean = 1000, sd = 10),period = "post", year = 2017)
> d2016post <- tibble(n = rnorm(25, mean = 900, sd = 10),period = "post", year = 2016)
> df <- bind_rows(d2017pre,d2016pre,d2017post,d2016post)


> df %>% group_by(year,period) %>% summarise(mean(n))
# A tibble: 4 x 3
# Groups: year [?]
   year period `mean(n)`
  <dbl> <chr>      <dbl>
1  2016 post         899
2  2016 pre         1498
3  2017 post         999
4  2017 pre         1104

Background

These are the three calculations I routinely do.

> # pre - post 2016
> pp16 <- 1498 - 899
> pp16
[1] 599
> 
> # pre - post 2017
> pp17 <-1100 - 999
> pp17
[1] 101
> 
> # net of control: pp2016 - pp2017 
> noc <- pp16 - pp17
> noc
[1] 498

The questions this answers is:

  1. What was the difference between the pre and post period in 2016 or 2017

  2. Was 2017s pre/post difference greater than 2016s pre/post difference.

I would like to answer these questions not just with estimates but also with confidence intervals. As mentioned above, I am planing on using lm() to get the confidence intervals of differences, but I am having difficulty getting the data in the correct format.

I believe that this will require two data sets. One for the difference of the periods in the year and one for the differences of the differences (net of control). This leads to the following questions.

Questions

  1. How can I calculated the differences of n grouped by period and year?

  2. How can I calculate the differences of differences?

Upvotes: 0

Views: 45

Answers (1)

erocoar
erocoar

Reputation: 5893

First, you can get the differences using another group_by.

diffs <- df %>% 
  group_by(year, period) %>% 
  summarise(mean = mean(n)) %>%
  group_by(year) %>%
  summarise(diff = diff(mean))

# A tibble: 2 x 2
   year  diff
  <dbl> <dbl>
1  2016   599
2  2017   105

The difference of the differences is similar, then (bad namespace maybe..)

diff(rev(diffs$diff))

[1] 493.8846

For the regression, you actually do not need to alter your dataframe - the data is needed to calculate the effects. I think (but not sure if I understand correctly) you are looking for a model with interaction effect?

E.g.,

m1 <- lm(n ~ period + factor(year) + period*factor(year), data = df)
summary(m1)

Note how the interaction effect is basically that difference

Upvotes: 1

Related Questions