Reputation: 45
I have data with some values ('value') and a grouping variable ('level'):
# Sample data frame
df <- data.frame(value = c(1,0,1,0,1,1,1,1,0,0),
level = c("A2","A3","A1","A2","A1","A3","A2","A1","A3","A2"))
value level
1 1 A2
2 0 A3
3 1 A1
4 0 A2
5 1 A1
6 1 A3
7 1 A2
8 1 A1
9 0 A3
10 0 A2
Conceptually, I assume that "A1" is the lowest level and "A3" the highest. I want to calculate mean value by including more and more groups in the calculation, like an 'expanding window', but based on entire groups, not on individual rows. Thus, for each group, the calculation will include values of the focal group and values in all other lower level groups.
I want the levels "A2" and "A3" to be inclusive. This means that when the focal level is "A2", the calculation includes values in group "A2" and values in the lower level "A1". When calculating the mean value of "A3", the calculation includes all values in the focal level "A3", and all lower levels, i.e. also values in group "A2" and "A1".
I aim to have an output like this:
MeanValue level
1 1.00 A1 # mean of level A1
2 0.71 A2 # mean of level A2 and A1
3 0.60 A3 # mean of level A3, A2, and A1
Could someone show me how to code this? I would really appreciate any help!
Upvotes: 3
Views: 246
Reputation: 67778
Here are two data.table
alternatives. As above, cumulative mean is calculated, but here base
and data.table
functions are used.
setDT(df)[order(level), .(level = unique(level),
mean_value = (cumsum(value) / .I)[!duplicated(level, fromLast = TRUE)])]
setDT(df)[order(level), .(level = unique(level),
mean_value = (cumsum(value) / .I)[rev(rowid(rev(level))) == 1])]
# level mean_value
# 1: A1 1.0000000
# 2: A2 0.7142857
# 3: A3 0.6000000
Although speed is not an issue for OP, it could be worth noting for future visitors that even on a modest data set, 200 groups with 200 rows each, these alternatives are both considerably faster than the tidyverse
code.
Prepare data for benchmark, 200 groups with 200 rows each:
ng = 2e2
n = 2e2
set.seed(1)
df = data.frame(value = rbinom(ng * n, 1, 0.5),
level = rep(seq(ng), each = n)
df3 = df
df4 = df
Benchmark:
microbenchmark(
f1 = {
d1 = df %>%
arrange(level) %>%
mutate(mean_value = map_dbl(.x = seq_along(level), ~ mean(value[1:.x]))) %>%
group_by(level) %>%
summarise(mean_value = last(mean_value))
},
f2 = {
d2 = df %>%
arrange(level) %>%
summarise(mean_value = cummean(value)[!duplicated(level, fromLast = TRUE)],
level = unique(level)) %>%
select(level, mean_value)
},
f3 = {
d3 = setDT(df3)[order(level), .(level = unique(level), mean_value = (cumsum(value) / .I)[!duplicated(level, fromLast = TRUE)])]
},
f4 = {
d4 = setDT(df4)[order(level), data.table(level = level, mean_value = cumsum(value) / .I)[rev(rowid(rev(level))) == 1]]
},
times = 10L)
Result:
# Unit: milliseconds
# expr min lq mean median uq max neval
# f1 2834.601301 2869.658601 3052.233971 3030.448801 3191.671301 3303.532801 10
# f2 127.474801 130.049101 138.202461 135.293550 141.301601 173.230901 10
# f3 2.929702 3.401802 3.706061 3.450152 3.838801 5.585200 10
# f4 3.888700 3.984301 4.372161 4.166951 4.844901 5.198001 10
all.equal(as.data.frame(d1), d2)
all.equal(as.data.frame(d1), as.data.frame(d3))
all.equal(as.data.frame(d1), as.data.frame(d4))
all.equal(as.data.frame(d1), as.data.frame(d4))
all.equal(as.data.frame(d1), as.data.frame(d5))
# TRUE
Another approach would be a non-equi join. A bit slower in this particular case, competing with the cumulative mean trick, but it is more general and can handle more complex functions.
setDT(df)[ , level := factor(level)]
df[.(level = sort(unique(df$level))), on = .(level <= level),
.(mean_value = mean(value)), by = .EACHI]
Upvotes: 2
Reputation: 887301
An option is to make use of cummean
library(dplyr)
df %>%
arrange(level) %>%
summarise(mean_value = cummean(value)[!duplicated(level,
fromLast = TRUE)], level = unique(level)) %>%
select(level, mean_value)
# level mean_value
#1 A1 1.0000000
#2 A2 0.7142857
#3 A3 0.6000000
Upvotes: 1
Reputation: 39858
One option using dplyr
and purrr
could be:
df %>%
arrange(level) %>%
mutate(mean_value = map_dbl(.x = seq_along(level), ~ mean(value[1:.x]))) %>%
group_by(level) %>%
summarise(mean_value = last(mean_value))
level mean_value
<fct> <dbl>
1 A1 1
2 A2 0.714
3 A3 0.6
Upvotes: 4