Anienumaked
Anienumaked

Reputation: 85

How to plot smoothed summary stats in ggplot2

I'm having trouble articulating this question. I have a dataset with daily income and expense for several years. I have been trying a few approaches so there are a lot of date columns now.

> str(df)
'data.frame':   3047 obs. of  8 variables:
 $ Date             : Factor w/ 1219 levels "2014-05-06T00:00:00.0000000",..: 6 9 2 3 4 6 10 11 13 14 ...
 $ YearMonthnumber  : Factor w/ 44 levels "2014/05","2014/06",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ cat              : Factor w/ 10 levels "Account Adjustment",..: 1 2 3 3 3 3 3 3 3 3 ...
 $ Value            : num  2.2 277.7 20 14.1 6.8 ...
 $ Income_or_expense: Factor w/ 2 levels "Expense","Income": 1 1 1 1 1 1 1 1 1 1 ...
 $ ddate            : Date, format: "2014-05-16" "2014-05-19" "2014-05-12" "2014-05-13" ...
 $ monthly          : Date, format: "2014-05-01" "2014-05-01" "2014-05-01" "2014-05-01" ...

Basically what I want to plot is:

I can do step one, but not two. Here is what I have:

ggplot(data = subset(df, cat!="Transfer"), aes(x = monthly, y= Value, colour = Income_or_expense)) +
  stat_summary(fun.y = sum, geom = "point") +
  scale_x_date(labels = date_format("%Y-%m"))

How can I add a smooth geom to these resulting summary stats?

Edit: If I add + stat_summary(fun.y = sum, geom = "smooth"), the result is a line graph, not a smoothed model. And if I add it without fun.y = sum, then the smoothed line is based on daily values, not the monthly aggregates

Thanks.

Upvotes: 3

Views: 2177

Answers (1)

eipi10
eipi10

Reputation: 93881

You could summarize the data by month first and then run geom_smooth on the summarized data. I've created some fake time series data for the example.

library(tidyverse)  
library(lubridate)

# Fake data
set.seed(2)
dat = data.frame(value = c(arima.sim(list(order = c(1,1,0), ar = 0.7), n = 364),
                           arima.sim(list(order = c(1,1,0), ar = 0.7), n = 364)) + 100,
                 IE = rep(c("Income","Expense"), each=365),
                 date = rep(seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by="day"), 2))

Now we sum by month and plot. I've included points for the actual monthly sums to compare with the smoother line:

ggplot(dat %>% group_by(IE, month=month(date, label=TRUE)) %>% 
         summarise(value=sum(value)), 
       aes(month, value, colour=IE, group=IE)) +
  geom_smooth(se=FALSE, span=0.75) +  # span=0.75 is the default
  geom_point() +
  expand_limits(y=0) +
  theme_classic()

enter image description here

I'm not that familiar with time series analysis, but it seems like a better approach would be to calculate the monthly income and expense rate represented by each daily value and then run a smoother through it. That way you're not summarizing away the variation in the underlying data. In the plot below, I've included the individual points so you can compare them with the smoother line.

ggplot(dat %>% group_by(IE, month=month(date, label=TRUE)) %>% 
         mutate(value = value * n()), 
       aes(date, value, colour=IE)) +
  geom_smooth(se=FALSE, span=0.75) +
  geom_point(alpha=0.3, size=1) +
  expand_limits(y=0) +
  theme_classic()

enter image description here

You could also plot the 30-day rolling sum, which avoids grouping the data into arbitrary time periods. Once again, I've included points for the monthly income and expense rate represented by each daily value.

library(xts)

ggplot(dat %>% group_by(IE) %>% 
         mutate(rolling_sum = rollsum(value, k=30, align="center", na.pad=TRUE),
                value = value * 30), 
       aes(date, colour=IE)) +
  geom_line(aes(y=rolling_sum), size=1) +
  geom_point(aes(y=value), alpha=0.2, size=1) +
  expand_limits(y=0) +
  theme_classic()

enter image description here

Upvotes: 3

Related Questions