TheGoat
TheGoat

Reputation: 2867

How to handle null subsets in R (dplyr)

I wish to calculate the average volume per group by sub-setting my data frame between two date ranges.

I have noticed that not all products have records between these two dates and need some way to handle these as right now they are captured as blanks, ideally these would be filled with 0 rather than blanks.

Here is a reprex:

product <- c("A","A","A","A","B","B","B","B","C")
date1 <- c("2020-06-20","2020-06-26","2020-08-20","2020-08-22","2020-06-26","2020-06-27","2020-08-21","2020-08-22","2019-06-20")
winA_start <- rep("2020-06-24", times = 9)
winA_end <- rep("2020-06-30", times = 9)
winB_start <- rep("2020-08-19", times = 9)
winB_end <- rep("2020-08-26", times = 9)
vol <- c(200,201,600,800,800,1000,50,100,208)
x <- data.frame(product,date1,winA_start,winA_end,winB_start,winB_end,vol)
x[,2:6] <- lapply(x[,2:6], as.Date)

I wish to run the following code to calculate the average value for both Window A and Window B

y <- x %>% group_by(product) %>% mutate(WIN_A_AVG = round(mean(vol[date1 >= winA_start & date1 <= winA_end]), digits = 0), WIN_B_AVG = round(mean(vol[date1 >= winB_start & date1 <= winB_end]), digit = 0))

This works as desired except for Product C where date1 does not meet the criteria and it therefore returns NaN.

Rather than return NaN is there a way I could catch these type of errors and just return 0 in an elegant way?

enter image description here

Upvotes: 0

Views: 98

Answers (1)

DaveArmstrong
DaveArmstrong

Reputation: 21757

If all you want to do is change the missing data to zero, you could do that in the same mutate statement.

y <- x %>% 
  group_by(product) %>% 
  mutate(WIN_A_AVG = round(mean(vol[date1 >= winA_start & date1 <= winA_end]), digits = 0), 
         WIN_B_AVG = round(mean(vol[date1 >= winB_start & date1 <= winB_end]), digit = 0), 
         WIN_A_AVG = case_when(is.na(WIN_A_AVG) ~ 0, 
                               TRUE ~ WIN_A_AVG), 
         WIN_B_AVG = case_when(is.na(WIN_B_AVG) ~ 0, 
                               TRUE ~ WIN_B_AVG))

Here, the case_when() function just indicates that when the result is missing, replace it with zero, otherwise keep it the same. However, if you wanted to only catch those cases where there were no observations, then you'd need something like the code below. This finds the number of observations first, then replaces only those that have zero observations with 0. This would work better if there is a chance that missing data are in vol such that you would want the mean to return NA sometimes as a check.

y <- x %>% 
  group_by(product) %>% 
  mutate(WIN_A_N = length(vol[date1 >= winA_start & date1 <= winA_end]), 
         WIN_B_N = length(vol[date1 >= winA_start & date1 <= winA_end]), 
         WIN_A_AVG = round(mean(vol[date1 >= winA_start & date1 <= winA_end]), digits = 0), 
         WIN_B_AVG = round(mean(vol[date1 >= winB_start & date1 <= winB_end]), digit = 0), 
         WIN_A_AVG = case_when(WIN_A_N == 0 ~ 0, 
                               TRUE ~ WIN_A_AVG), 
         WIN_B_AVG = case_when(WIN_B_N == 0 ~ 0, 
                               TRUE ~ WIN_B_AVG)) %>%
  select(-WIN_A_N, -WIN_B_N)
y
# A tibble: 9 x 9
# Groups:   product [3]
#  product date1      winA_start winA_end   winB_start winB_end     vol WIN_A_AVG WIN_B_AVG
#  <chr>   <date>     <date>     <date>     <date>     <date>     <dbl>     <dbl>     <dbl>
#1 A       2020-06-20 2020-06-24 2020-06-30 2020-08-19 2020-08-26   200       201       700
#2 A       2020-06-26 2020-06-24 2020-06-30 2020-08-19 2020-08-26   201       201       700
#3 A       2020-08-20 2020-06-24 2020-06-30 2020-08-19 2020-08-26   600       201       700
#4 A       2020-08-22 2020-06-24 2020-06-30 2020-08-19 2020-08-26   800       201       700
#5 B       2020-06-26 2020-06-24 2020-06-30 2020-08-19 2020-08-26   800       900        75
#6 B       2020-06-27 2020-06-24 2020-06-30 2020-08-19 2020-08-26  1000       900        75
#7 B       2020-08-21 2020-06-24 2020-06-30 2020-08-19 2020-08-26    50       900        75
#8 B       2020-08-22 2020-06-24 2020-06-30 2020-08-19 2020-08-26   100       900        75
#9 C       2019-06-20 2020-06-24 2020-06-30 2020-08-19 2020-08-26   208         0         0

Upvotes: 2

Related Questions