Dan
Dan

Reputation: 12084

ggplot2, facet_wrap: plotting data twice in different facets

Say I have a data frame like this:

df <- data.frame(year_day = rep(1:365, 3), 
                 year = rep(2001:2003, each = 365), 
                 value = sin(2*pi*rep(1:365, 3)/365))

It represents some value (value) for each day of the year (year_day) between 2001 and 2003. I'd like to plot each year and use ggplot2 to do so.

ggplot(df) + geom_point(aes(year_day, value)) + facet_wrap(~year, ncol=1)

This gives me:

enter image description here

Great. Now, say I want to extend my plotting region a bit, so that each year includes 3 months of the preceding year and 3 months of the next year (if these data exist). This means that some data will be plotted twice. For example, the first three months of 2003 will appear in the plots for 2002 and 2003. So, I could duplicate these rows and assign them to 2002, but with year-days 366 to 485. This works, but is cludgy. Is there a more elegant solution?

Upvotes: 4

Views: 1480

Answers (1)

Brian
Brian

Reputation: 8295

Edited to remove old version and replace

This is something I'd been thinking about for a while, so this was a good enough reason to try to implement it. It does still involve duplicating rows, which is kludgy, but it's the best way I could think of.

This is a tidy pipe-able function that takes a dataframe (even a grouped one) as its first argument, and a column of dates as its second. There's an optional third argument to extend how far each window expands (defaults to 0.25, or 3 months). The fourth argument would be for things like fiscal or academic years that aren't Jan-Jan, but I haven't thought deeply through that one yet.

The output is the same dataframe, with the duplicated rows for the tails of the years, with additional columns doy_wrapped for the day of year (going from negatives to >365), and nominal_yr, which is the year that each window is centered on.

Example, using the dataset ggplot2::economics:

library(dplyr)
library(lubridate)

economics %>% 
  filter(year(date) > 2007) 
# A tibble: 88 x 6
         date     pce    pop psavert uempmed unemploy
       <date>   <dbl>  <int>   <dbl>   <dbl>    <int>
 1 2008-01-01  9963.2 303506     3.4     9.0     7685
 2 2008-02-01  9955.7 303711     3.9     8.7     7497
 3 2008-03-01 10004.2 303907     4.0     8.7     7822
 4 2008-04-01 10044.6 304117     3.5     9.4     7637
 5 2008-05-01 10093.3 304323     7.9     7.9     8395
 6 2008-06-01 10149.4 304556     5.6     9.0     8575
 7 2008-07-01 10151.1 304798     4.4     9.7     8937
 8 2008-08-01 10140.3 305045     3.7     9.7     9438
 9 2008-09-01 10083.2 305309     4.4    10.2     9494
10 2008-10-01  9983.3 305554     5.4    10.4    10074
# ... with 78 more rows

economics %>% 
  filter(year(date) > 2007) %>% 
  wrap_years(date, expand = 3/12)
# A tibble: 136 x 8
# Groups:   nominal_yr [8]
         date     pce    pop psavert uempmed unemploy nominal_yr doy_wrapped
       <date>   <dbl>  <int>   <dbl>   <dbl>    <int>      <dbl>       <dbl>
 1 2008-01-01  9963.2 303506     3.4     9.0     7685       2008           1
 2 2008-02-01  9955.7 303711     3.9     8.7     7497       2008          32
 3 2008-03-01 10004.2 303907     4.0     8.7     7822       2008          61
 4 2008-04-01 10044.6 304117     3.5     9.4     7637       2008          92
 5 2008-05-01 10093.3 304323     7.9     7.9     8395       2008         122
 6 2008-06-01 10149.4 304556     5.6     9.0     8575       2008         153
 7 2008-07-01 10151.1 304798     4.4     9.7     8937       2008         183
 8 2008-08-01 10140.3 305045     3.7     9.7     9438       2008         214
 9 2008-09-01 10083.2 305309     4.4    10.2     9494       2008         245
10 2008-10-01  9983.3 305554     5.4    10.4    10074       2009         -90
# ... with 126 more rows

This does knock it out of order somewhat; it triplicates rows in their order, then reassigns them to neighboring years. It preserves the original grouping while adding one for the new nominal_yr (to remove possibly orphaned tails, where the central year data is missing).

economics %>% 
  filter(year(date) > 2007) %>% 
  wrap_years(date, expand = 3/12) %>%
  ggplot(aes(doy_wrapped, unemploy)) + 
  geom_line() + facet_wrap(~nominal_yr, ncol = 3)

enter image description here

And then a couple tricks to dress it up and correct the axis:

economics %>% 
  filter(year(date) > 2007) %>% 
  wrap_years(date, expand = 3/12) %>%
  ggplot(aes(doy_wrapped + ymd("1900-01-01") - 1, unemploy)) + 
  geom_line() + facet_wrap(~nominal_yr, ncol = 2) +
  geom_vline(xintercept = as.numeric(c(ymd("1900-01-01"), ymd("1901-01-01")))) +
  scale_x_date(date_breaks = "2 months",date_labels = "%b",
               name = NULL, expand = c(0,0) +
  theme_minimal() +
  theme(panel.spacing.x = unit(1, "cm"))

The + ymd("1900-01-01") - 1 in the aes(...) is arbitrary, you just want it to line up with a January 1 so that each year has the right months. Then you match it to the xintercept = in the vertical lines.

Ideally this would eventually be part of a family of wrap_* functions, for quarters, months, hours, decades, etc.

enter image description here

Code for the function:

wrap_years <- function(df, datecol, expand = 0.25, offset = "2001-01-01") {

  if(!is.data.frame(df)) {return(df)}

  datecol <- enquo(datecol)

  if(expand > 1) {
    warning(paste0("Window expansions of > 1 are not supported."))
    return(df)
  }


  if(!(quo_name(datecol) %in% names(df))) {
    warning(paste0("Column '", quo_name(datecol), "' not found in data."))
    return(df)
  }

  # offset <- as_date(offset)
  # warning(paste0("Using  ", stamp("August 26", orders = "md")(offset), 
  #                " as start of year. Not yet implemented."))

  if(!is.Date(df %>% pull(!!datecol))) {
    warning(paste0("Use lubridate functions to parse '", 
                   quo_name(datecol), 
                   "' before proceeding."))
    return(df)
  }

  df %>% 
    mutate(adj_wrap = list(-1:1)) %>% 
    tidyr::unnest() %>% 
    mutate(nominal_yr =  year(!!datecol) -     adj_wrap,
           doy_wrapped = yday(!!datecol) + 365*adj_wrap) %>% 
    filter(between(doy_wrapped, -expand * 365, (1 + expand) * 365)) %>% 
    select(-adj_wrap) %>% 
    group_by(nominal_yr, add = T) %>% 
    filter(sum(year(!!datecol) != nominal_yr) != length(nominal_yr))

}

I had assumed that copying the least number of rows was going to be the fastest method, which was the paradigm behind my first stab at it. Thinking about it later, I realized a more naive approach would be to simply copy all the rows, which turns out to be much faster. Then the filtering step is done with between, which is also fast. This version of the function is about 2x the speed of the previous version (but about 0.01x the speed of plotting the raw data).

Upvotes: 1

Related Questions