Reputation: 2867
I wish to calculate the average volume per group by sub-setting my data frame between two date ranges.
I have noticed that not all products have records between these two dates and need some way to handle these as right now they are captured as blanks, ideally these would be filled with 0 rather than blanks.
Here is a reprex:
product <- c("A","A","A","A","B","B","B","B","C")
date1 <- c("2020-06-20","2020-06-26","2020-08-20","2020-08-22","2020-06-26","2020-06-27","2020-08-21","2020-08-22","2019-06-20")
winA_start <- rep("2020-06-24", times = 9)
winA_end <- rep("2020-06-30", times = 9)
winB_start <- rep("2020-08-19", times = 9)
winB_end <- rep("2020-08-26", times = 9)
vol <- c(200,201,600,800,800,1000,50,100,208)
x <- data.frame(product,date1,winA_start,winA_end,winB_start,winB_end,vol)
x[,2:6] <- lapply(x[,2:6], as.Date)
I wish to run the following code to calculate the average value for both Window A and Window B
y <- x %>% group_by(product) %>% mutate(WIN_A_AVG = round(mean(vol[date1 >= winA_start & date1 <= winA_end]), digits = 0), WIN_B_AVG = round(mean(vol[date1 >= winB_start & date1 <= winB_end]), digit = 0))
This works as desired except for Product C where date1 does not meet the criteria and it therefore returns NaN.
Rather than return NaN is there a way I could catch these type of errors and just return 0 in an elegant way?
Upvotes: 0
Views: 98
Reputation: 21757
If all you want to do is change the missing data to zero, you could do that in the same mutate statement.
y <- x %>%
group_by(product) %>%
mutate(WIN_A_AVG = round(mean(vol[date1 >= winA_start & date1 <= winA_end]), digits = 0),
WIN_B_AVG = round(mean(vol[date1 >= winB_start & date1 <= winB_end]), digit = 0),
WIN_A_AVG = case_when(is.na(WIN_A_AVG) ~ 0,
TRUE ~ WIN_A_AVG),
WIN_B_AVG = case_when(is.na(WIN_B_AVG) ~ 0,
TRUE ~ WIN_B_AVG))
Here, the case_when()
function just indicates that when the result is missing, replace it with zero, otherwise keep it the same. However, if you wanted to only catch those cases where there were no observations, then you'd need something like the code below. This finds the number of observations first, then replaces only those that have zero observations with 0. This would work better if there is a chance that missing data are in vol
such that you would want the mean to return NA
sometimes as a check.
y <- x %>%
group_by(product) %>%
mutate(WIN_A_N = length(vol[date1 >= winA_start & date1 <= winA_end]),
WIN_B_N = length(vol[date1 >= winA_start & date1 <= winA_end]),
WIN_A_AVG = round(mean(vol[date1 >= winA_start & date1 <= winA_end]), digits = 0),
WIN_B_AVG = round(mean(vol[date1 >= winB_start & date1 <= winB_end]), digit = 0),
WIN_A_AVG = case_when(WIN_A_N == 0 ~ 0,
TRUE ~ WIN_A_AVG),
WIN_B_AVG = case_when(WIN_B_N == 0 ~ 0,
TRUE ~ WIN_B_AVG)) %>%
select(-WIN_A_N, -WIN_B_N)
y
# A tibble: 9 x 9
# Groups: product [3]
# product date1 winA_start winA_end winB_start winB_end vol WIN_A_AVG WIN_B_AVG
# <chr> <date> <date> <date> <date> <date> <dbl> <dbl> <dbl>
#1 A 2020-06-20 2020-06-24 2020-06-30 2020-08-19 2020-08-26 200 201 700
#2 A 2020-06-26 2020-06-24 2020-06-30 2020-08-19 2020-08-26 201 201 700
#3 A 2020-08-20 2020-06-24 2020-06-30 2020-08-19 2020-08-26 600 201 700
#4 A 2020-08-22 2020-06-24 2020-06-30 2020-08-19 2020-08-26 800 201 700
#5 B 2020-06-26 2020-06-24 2020-06-30 2020-08-19 2020-08-26 800 900 75
#6 B 2020-06-27 2020-06-24 2020-06-30 2020-08-19 2020-08-26 1000 900 75
#7 B 2020-08-21 2020-06-24 2020-06-30 2020-08-19 2020-08-26 50 900 75
#8 B 2020-08-22 2020-06-24 2020-06-30 2020-08-19 2020-08-26 100 900 75
#9 C 2019-06-20 2020-06-24 2020-06-30 2020-08-19 2020-08-26 208 0 0
Upvotes: 2