Reputation: 674
I am trying to apply a filter within a mutate, but I haven't figured out the right way to apply a filter while keeping the data frame grouping in tact.
Here is a simple reproducible example:
# Sample data
my_dates = seq(as.Date("2020/1/1"), by = "month", length.out = 6)
grp = c(rep("A",3), rep("B", 3))
x = c(2,4,6,8,10,12)
my_df <- data.frame(my_dates, grp, x)
my_dates grp x
1 2020-01-01 A 2
2 2020-02-01 A 4
3 2020-03-01 A 6
4 2020-04-01 B 8
5 2020-05-01 B 10
6 2020-06-01 B 12
# Pick a max date for which the data will be filtered
max_date <- "2020-05-01"
# Try to get the average by group, after filtering out the max date included
filt_data <- my_df %>%
group_by(grp) %>%
mutate(included_data = my_dates < max_date,
my_mean = mean(filter(., my_dates < max_date)$x)
)
# A tibble: 6 x 5
# Groups: grp [2]
my_dates grp x included_data my_mean
<date> <fct> <dbl> <lgl> <dbl>
1 2020-01-01 A 2 TRUE 5
2 2020-02-01 A 4 TRUE 5
3 2020-03-01 A 6 TRUE 5
4 2020-04-01 B 8 TRUE 5
5 2020-05-01 B 10 FALSE 5
6 2020-06-01 B 12 FALSE 5
The output I was hoping to get was this, where the mean of included data for Group A = mean (2,4,6) = 4 and the mean of the included data for Group B = mean(8) = 8:
my_dates grp x included_data my_mean
<date> <fct> <dbl> <lgl> <dbl>
1 2020-01-01 A 2 TRUE 4
2 2020-02-01 A 4 TRUE 4
3 2020-03-01 A 6 TRUE 4
4 2020-04-01 B 8 TRUE 8
5 2020-05-01 B 10 FALSE 8
6 2020-06-01 B 12 FALSE 8
I'm not sure what the correct mutate and filter would be so help is appreciated, as is an explanation for why the above does not work as intended.
Thanks!
Upvotes: 2
Views: 360
Reputation: 887951
Here, it may be better to use the index from 'included_data' to subset the 'x' column instead of doing another filter
library(dplyr)
my_df %>%
group_by(grp) %>%
mutate(included_data = my_dates < max_date,
my_mean = mean(x[included_data])) %>%
ungroup
-output
# A tibble: 6 x 5
# my_dates grp x included_data my_mean
# <date> <chr> <dbl> <lgl> <dbl>
#1 2020-01-01 A 2 TRUE 4
#2 2020-02-01 A 4 TRUE 4
#3 2020-03-01 A 6 TRUE 4
#4 2020-04-01 B 8 TRUE 8
#5 2020-05-01 B 10 FALSE 8
#6 2020-06-01 B 12 FALSE 8
Regarding why the OP's code didn't work, the .
is the full dataset and it is doing the subset from that full data instead of the grouped data. We could use cur_data()
instead of .
my_df %>%
group_by(grp) %>%
mutate(included_data = my_dates < max_date,
my_mean = mean(filter(cur_data(), my_dates < max_date)$x)) %>%
ungroup
# A tibble: 6 x 5
# my_dates grp x included_data my_mean
# <date> <chr> <dbl> <lgl> <dbl>
#1 2020-01-01 A 2 TRUE 4
#2 2020-02-01 A 4 TRUE 4
#3 2020-03-01 A 6 TRUE 4
#4 2020-04-01 B 8 TRUE 8
#5 2020-05-01 B 10 FALSE 8
#6 2020-06-01 B 12 FALSE 8
Upvotes: 2