Reputation: 12084
Say I have a data frame like this:
df <- data.frame(year_day = rep(1:365, 3),
year = rep(2001:2003, each = 365),
value = sin(2*pi*rep(1:365, 3)/365))
It represents some value (value
) for each day of the year (year_day
) between 2001 and 2003. I'd like to plot each year and use ggplot2
to do so.
ggplot(df) + geom_point(aes(year_day, value)) + facet_wrap(~year, ncol=1)
This gives me:
Great. Now, say I want to extend my plotting region a bit, so that each year includes 3 months of the preceding year and 3 months of the next year (if these data exist). This means that some data will be plotted twice. For example, the first three months of 2003 will appear in the plots for 2002 and 2003. So, I could duplicate these rows and assign them to 2002, but with year-day
s 366 to 485. This works, but is cludgy. Is there a more elegant solution?
Upvotes: 4
Views: 1480
Reputation: 8295
This is something I'd been thinking about for a while, so this was a good enough reason to try to implement it. It does still involve duplicating rows, which is kludgy, but it's the best way I could think of.
This is a tidy pipe-able function that takes a dataframe (even a grouped one) as its first argument, and a column of dates as its second. There's an optional third argument to extend how far each window expands (defaults to 0.25, or 3 months). The fourth argument would be for things like fiscal or academic years that aren't Jan-Jan, but I haven't thought deeply through that one yet.
The output is the same dataframe, with the duplicated rows for the tails of the years, with additional columns doy_wrapped for the day of year (going from negatives to >365), and nominal_yr, which is the year that each window is centered on.
Example, using the dataset ggplot2::economics
:
library(dplyr)
library(lubridate)
economics %>%
filter(year(date) > 2007)
# A tibble: 88 x 6 date pce pop psavert uempmed unemploy <date> <dbl> <int> <dbl> <dbl> <int> 1 2008-01-01 9963.2 303506 3.4 9.0 7685 2 2008-02-01 9955.7 303711 3.9 8.7 7497 3 2008-03-01 10004.2 303907 4.0 8.7 7822 4 2008-04-01 10044.6 304117 3.5 9.4 7637 5 2008-05-01 10093.3 304323 7.9 7.9 8395 6 2008-06-01 10149.4 304556 5.6 9.0 8575 7 2008-07-01 10151.1 304798 4.4 9.7 8937 8 2008-08-01 10140.3 305045 3.7 9.7 9438 9 2008-09-01 10083.2 305309 4.4 10.2 9494 10 2008-10-01 9983.3 305554 5.4 10.4 10074 # ... with 78 more rows
economics %>%
filter(year(date) > 2007) %>%
wrap_years(date, expand = 3/12)
# A tibble: 136 x 8 # Groups: nominal_yr [8] date pce pop psavert uempmed unemploy nominal_yr doy_wrapped <date> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl> 1 2008-01-01 9963.2 303506 3.4 9.0 7685 2008 1 2 2008-02-01 9955.7 303711 3.9 8.7 7497 2008 32 3 2008-03-01 10004.2 303907 4.0 8.7 7822 2008 61 4 2008-04-01 10044.6 304117 3.5 9.4 7637 2008 92 5 2008-05-01 10093.3 304323 7.9 7.9 8395 2008 122 6 2008-06-01 10149.4 304556 5.6 9.0 8575 2008 153 7 2008-07-01 10151.1 304798 4.4 9.7 8937 2008 183 8 2008-08-01 10140.3 305045 3.7 9.7 9438 2008 214 9 2008-09-01 10083.2 305309 4.4 10.2 9494 2008 245 10 2008-10-01 9983.3 305554 5.4 10.4 10074 2009 -90 # ... with 126 more rows
This does knock it out of order somewhat; it triplicates rows in their order, then reassigns them to neighboring years. It preserves the original grouping while adding one for the new nominal_yr (to remove possibly orphaned tails, where the central year data is missing).
economics %>%
filter(year(date) > 2007) %>%
wrap_years(date, expand = 3/12) %>%
ggplot(aes(doy_wrapped, unemploy)) +
geom_line() + facet_wrap(~nominal_yr, ncol = 3)
And then a couple tricks to dress it up and correct the axis:
economics %>%
filter(year(date) > 2007) %>%
wrap_years(date, expand = 3/12) %>%
ggplot(aes(doy_wrapped + ymd("1900-01-01") - 1, unemploy)) +
geom_line() + facet_wrap(~nominal_yr, ncol = 2) +
geom_vline(xintercept = as.numeric(c(ymd("1900-01-01"), ymd("1901-01-01")))) +
scale_x_date(date_breaks = "2 months",date_labels = "%b",
name = NULL, expand = c(0,0) +
theme_minimal() +
theme(panel.spacing.x = unit(1, "cm"))
The + ymd("1900-01-01") - 1
in the aes(...)
is arbitrary, you just want it to line up with a January 1 so that each year has the right months. Then you match it to the xintercept =
in the vertical lines.
Ideally this would eventually be part of a family of wrap_*
functions, for quarters, months, hours, decades, etc.
Code for the function:
wrap_years <- function(df, datecol, expand = 0.25, offset = "2001-01-01") {
if(!is.data.frame(df)) {return(df)}
datecol <- enquo(datecol)
if(expand > 1) {
warning(paste0("Window expansions of > 1 are not supported."))
return(df)
}
if(!(quo_name(datecol) %in% names(df))) {
warning(paste0("Column '", quo_name(datecol), "' not found in data."))
return(df)
}
# offset <- as_date(offset)
# warning(paste0("Using ", stamp("August 26", orders = "md")(offset),
# " as start of year. Not yet implemented."))
if(!is.Date(df %>% pull(!!datecol))) {
warning(paste0("Use lubridate functions to parse '",
quo_name(datecol),
"' before proceeding."))
return(df)
}
df %>%
mutate(adj_wrap = list(-1:1)) %>%
tidyr::unnest() %>%
mutate(nominal_yr = year(!!datecol) - adj_wrap,
doy_wrapped = yday(!!datecol) + 365*adj_wrap) %>%
filter(between(doy_wrapped, -expand * 365, (1 + expand) * 365)) %>%
select(-adj_wrap) %>%
group_by(nominal_yr, add = T) %>%
filter(sum(year(!!datecol) != nominal_yr) != length(nominal_yr))
}
I had assumed that copying the least number of rows was going to be the fastest method, which was the paradigm behind my first stab at it. Thinking about it later, I realized a more naive approach would be to simply copy all the rows, which turns out to be much faster. Then the filtering step is done with between
, which is also fast. This version of the function is about 2x the speed of the previous version (but about 0.01x the speed of plotting the raw data).
Upvotes: 1