Ryan G
Ryan G

Reputation: 7

Accounting for "missing" values, dealing with instances when 0's should be entered

Okay, so I know this looks like a million other problems of the same type. But I dont think I've seen one with this exact issue so here I am haha.

Lets say you have a dataframe of sales by customer, sku and date. All is good in the world. But what if you had this df, it was huge (hundreds of customers, and if we wanted to punish ourselves further, hundreds of skus each, every week). Then, after grouping by customer, sku, and week, then summarizing, we realize that not all skus are bought by all customers, every week. This complicates things.

For example, without a line entry for each week by customer with a revenue value of 0 for every sku that wasnt bought, plotting with geom_line and geom_smooth becomes inaccurate. the lines continue on their trajectory, instead of dropping to zero.

I might just be very stupid right now, but how do I introduce all these entries with 0 revenue every time there is a week where a sku wasnt bought by any given customer so that I get the appropriate graphs I would need? Or is there an entirely different way of dealing with this sort of thing?

# Extremely simplified
# Plot from DF WO zeros

df_plot<- data.frame(x=c(1,2,3,4,5,10), y=c(1,2,3,4,5,10))
ggplot(df_plot,aes(x,y))+
  geom_point()+
  geom_smooth()+
  geom_line()

# Theoretical plot where we have zeros where no sales were made

df_plot2<- data.frame(x=c(1,2,3,4,5,6,7,8,9,10), y=c(1,2,3,4,5,0,0,0,0,10))
ggplot(df_plot2,aes(x,y))+
  geom_point()+
  geom_smooth()+
  geom_line()

I dont know if theres just an easy way for dealing with this in the graph? Otherwise I need to add 0's in the revenue column for every week a sku wasnt bought by the customer. If that's the case, is there an efficient way of doing that? Thanks as always!

Upvotes: 0

Views: 62

Answers (1)

Giovanni Colitti
Giovanni Colitti

Reputation: 2344

You can fill missing values for levels of a factor variable or continuous variable that are not present in the data set with tidyr::complete and tidyr::full_seq.

In your case, I think you want:

# Load required packages.
library(ggplot2)
library(tidyr)

# Create data frame with 0 values omitted.
df_plot<- data.frame(x=c(1,2,3,4,5,10), y=c(1,2,3,4,5,10))
ggplot(df_plot,aes(x,y))+
  geom_point()+
  geom_smooth()+
  geom_line()

# Fill gap in continuous variable with 0's and plot. 
df_plot %>% 
  complete(x = full_seq(x, 1), fill = list(y = 0)) %>% 
  ggplot(., aes(x, y)) +
  geom_point() +
  geom_smooth() +
  geom_line()

Alternatively, you can do this with the statar package's fill_gap function:

df_plot %>% 
statar::fill_gap(x, full = TRUE) %>% 
  replace_na(list(y = 0)) %>% 
  ggplot(., aes(x, y)) +
  geom_point() +
  geom_smooth() +
  geom_line()

Upvotes: 1

Related Questions