Reputation: 85
I'm having trouble articulating this question. I have a dataset with daily income and expense for several years. I have been trying a few approaches so there are a lot of date columns now.
> str(df)
'data.frame': 3047 obs. of 8 variables:
$ Date : Factor w/ 1219 levels "2014-05-06T00:00:00.0000000",..: 6 9 2 3 4 6 10 11 13 14 ...
$ YearMonthnumber : Factor w/ 44 levels "2014/05","2014/06",..: 1 1 1 1 1 1 1 1 1 1 ...
$ cat : Factor w/ 10 levels "Account Adjustment",..: 1 2 3 3 3 3 3 3 3 3 ...
$ Value : num 2.2 277.7 20 14.1 6.8 ...
$ Income_or_expense: Factor w/ 2 levels "Expense","Income": 1 1 1 1 1 1 1 1 1 1 ...
$ ddate : Date, format: "2014-05-16" "2014-05-19" "2014-05-12" "2014-05-13" ...
$ monthly : Date, format: "2014-05-01" "2014-05-01" "2014-05-01" "2014-05-01" ...
Basically what I want to plot is:
I can do step one, but not two. Here is what I have:
ggplot(data = subset(df, cat!="Transfer"), aes(x = monthly, y= Value, colour = Income_or_expense)) +
stat_summary(fun.y = sum, geom = "point") +
scale_x_date(labels = date_format("%Y-%m"))
How can I add a smooth geom to these resulting summary stats?
Edit: If I add + stat_summary(fun.y = sum, geom = "smooth")
, the result is a line graph, not a smoothed model. And if I add it without fun.y = sum
, then the smoothed line is based on daily values, not the monthly aggregates
Thanks.
Upvotes: 3
Views: 2177
Reputation: 93881
You could summarize the data by month first and then run geom_smooth
on the summarized data. I've created some fake time series data for the example.
library(tidyverse)
library(lubridate)
# Fake data
set.seed(2)
dat = data.frame(value = c(arima.sim(list(order = c(1,1,0), ar = 0.7), n = 364),
arima.sim(list(order = c(1,1,0), ar = 0.7), n = 364)) + 100,
IE = rep(c("Income","Expense"), each=365),
date = rep(seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by="day"), 2))
Now we sum by month and plot. I've included points for the actual monthly sums to compare with the smoother line:
ggplot(dat %>% group_by(IE, month=month(date, label=TRUE)) %>%
summarise(value=sum(value)),
aes(month, value, colour=IE, group=IE)) +
geom_smooth(se=FALSE, span=0.75) + # span=0.75 is the default
geom_point() +
expand_limits(y=0) +
theme_classic()
I'm not that familiar with time series analysis, but it seems like a better approach would be to calculate the monthly income and expense rate represented by each daily value and then run a smoother through it. That way you're not summarizing away the variation in the underlying data. In the plot below, I've included the individual points so you can compare them with the smoother line.
ggplot(dat %>% group_by(IE, month=month(date, label=TRUE)) %>%
mutate(value = value * n()),
aes(date, value, colour=IE)) +
geom_smooth(se=FALSE, span=0.75) +
geom_point(alpha=0.3, size=1) +
expand_limits(y=0) +
theme_classic()
You could also plot the 30-day rolling sum, which avoids grouping the data into arbitrary time periods. Once again, I've included points for the monthly income and expense rate represented by each daily value.
library(xts)
ggplot(dat %>% group_by(IE) %>%
mutate(rolling_sum = rollsum(value, k=30, align="center", na.pad=TRUE),
value = value * 30),
aes(date, colour=IE)) +
geom_line(aes(y=rolling_sum), size=1) +
geom_point(aes(y=value), alpha=0.2, size=1) +
expand_limits(y=0) +
theme_classic()
Upvotes: 3