Elise
Elise

Reputation: 45

Calculate mean by inclusive levels - expanding window based on entire groups

I have data with some values ('value') and a grouping variable ('level'):

# Sample data frame
df <- data.frame(value = c(1,0,1,0,1,1,1,1,0,0),
                 level = c("A2","A3","A1","A2","A1","A3","A2","A1","A3","A2"))

    value level
1      1     A2
2      0     A3
3      1     A1
4      0     A2
5      1     A1
6      1     A3
7      1     A2
8      1     A1
9      0     A3
10     0     A2

Conceptually, I assume that "A1" is the lowest level and "A3" the highest. I want to calculate mean value by including more and more groups in the calculation, like an 'expanding window', but based on entire groups, not on individual rows. Thus, for each group, the calculation will include values of the focal group and values in all other lower level groups.

I want the levels "A2" and "A3" to be inclusive. This means that when the focal level is "A2", the calculation includes values in group "A2" and values in the lower level "A1". When calculating the mean value of "A3", the calculation includes all values in the focal level "A3", and all lower levels, i.e. also values in group "A2" and "A1".

I aim to have an output like this:

  MeanValue  level
1      1.00     A1 # mean of level A1
2      0.71     A2 # mean of level A2 and A1
3      0.60     A3 # mean of level A3, A2, and A1

Could someone show me how to code this? I would really appreciate any help!

Upvotes: 3

Views: 246

Answers (3)

Henrik
Henrik

Reputation: 67778

Here are two data.table alternatives. As above, cumulative mean is calculated, but here base and data.table functions are used.

setDT(df)[order(level), .(level = unique(level),
                          mean_value = (cumsum(value) / .I)[!duplicated(level, fromLast = TRUE)])]

setDT(df)[order(level), .(level = unique(level),
                          mean_value = (cumsum(value) / .I)[rev(rowid(rev(level))) == 1])]

#    level mean_value
# 1:    A1  1.0000000
# 2:    A2  0.7142857
# 3:    A3  0.6000000

Although speed is not an issue for OP, it could be worth noting for future visitors that even on a modest data set, 200 groups with 200 rows each, these alternatives are both considerably faster than the tidyverse code.

Prepare data for benchmark, 200 groups with 200 rows each:

ng = 2e2
n = 2e2

set.seed(1)
df = data.frame(value = rbinom(ng * n, 1, 0.5),
                level = rep(seq(ng), each = n)

df3 = df
df4 = df

Benchmark:

microbenchmark(
  f1 = {
    d1 = df %>%
      arrange(level) %>%
      mutate(mean_value = map_dbl(.x = seq_along(level), ~ mean(value[1:.x]))) %>%
      group_by(level) %>%
      summarise(mean_value = last(mean_value))
  },
  
  f2 = {
    d2 = df %>%
      arrange(level) %>%
      summarise(mean_value = cummean(value)[!duplicated(level, fromLast = TRUE)], 
                level = unique(level)) %>%
      select(level, mean_value)
  },
  
  f3 = {
    d3 = setDT(df3)[order(level), .(level = unique(level), mean_value = (cumsum(value) / .I)[!duplicated(level, fromLast = TRUE)])]
  },
  
  f4 = {
    d4 = setDT(df4)[order(level), data.table(level = level, mean_value = cumsum(value) / .I)[rev(rowid(rev(level))) == 1]]
  },
  times = 10L)

Result:

# Unit: milliseconds
# expr         min          lq        mean      median          uq         max neval
#   f1 2834.601301 2869.658601 3052.233971 3030.448801 3191.671301 3303.532801    10
#   f2  127.474801  130.049101  138.202461  135.293550  141.301601  173.230901    10
#   f3    2.929702    3.401802    3.706061    3.450152    3.838801    5.585200    10
#   f4    3.888700    3.984301    4.372161    4.166951    4.844901    5.198001    10

all.equal(as.data.frame(d1), d2)
all.equal(as.data.frame(d1), as.data.frame(d3))
all.equal(as.data.frame(d1), as.data.frame(d4))
all.equal(as.data.frame(d1), as.data.frame(d4))
all.equal(as.data.frame(d1), as.data.frame(d5))
# TRUE

Another approach would be a non-equi join. A bit slower in this particular case, competing with the cumulative mean trick, but it is more general and can handle more complex functions.

setDT(df)[ , level := factor(level)]
df[.(level = sort(unique(df$level))), on = .(level <= level),
   .(mean_value = mean(value)), by = .EACHI]

Upvotes: 2

akrun
akrun

Reputation: 887301

An option is to make use of cummean

library(dplyr)
df %>%
   arrange(level) %>% 
   summarise(mean_value = cummean(value)[!duplicated(level, 
       fromLast = TRUE)], level = unique(level)) %>% 
   select(level, mean_value)
#  level mean_value
#1    A1  1.0000000
#2    A2  0.7142857
#3    A3  0.6000000

Upvotes: 1

tmfmnk
tmfmnk

Reputation: 39858

One option using dplyr and purrr could be:

df %>%
 arrange(level) %>%
 mutate(mean_value = map_dbl(.x = seq_along(level), ~ mean(value[1:.x]))) %>%
 group_by(level) %>%
 summarise(mean_value = last(mean_value))

  level mean_value
  <fct>      <dbl>
1 A1         1    
2 A2         0.714
3 A3         0.6  

Upvotes: 4

Related Questions