Danny
Danny

Reputation: 488

dplyr: keep empty levels of factor but not empty levels of a combination of factors that don't appear in data

When grouping and summarising with dplyr, what is the correct way to keep empty levels of each grouping factor but not keep empty combinations from multiple grouping factors?

As an example, consider data recorded at different times at multiple sites. I might filter and then calculate something for each year in each site. I'd like to have the default value of the summary on an empty vector if the filter removes a year completely. So site "a" has 10 years and site "b" has 1 year so I'd always like 11 rows in the summary.

If I use .drop = TRUE in group_by I lose years:

library(dplyr)
library(zoo)
library(lubridate)

set.seed(1)

df <- data.frame(site = factor(c(rep("a", 120), rep("b", 12))),
                 date = c(seq.Date(as.Date("2000/1/1"), by = "month", length.out = 120), seq.Date(as.Date("2000/1/1"), by = "month", length.out = 12)),
                 value = rnorm(132, 50, 10))
df$year <- factor(lubridate::year(df$date))

df %>% 
  filter(value > 65) %>%
  group_by(site, year, .drop = TRUE) %>%
  summarise(f = first(date))
#> # A tibble: 6 x 3
#> # Groups:   site [1]
#>   site  year  f         
#>   <fct> <fct> <date>    
#> 1 a     2000  2000-04-01
#> 2 a     2004  2004-08-01
#> 3 a     2005  2005-01-01
#> 4 a     2007  2007-11-01
#> 5 a     2008  2008-10-01
#> 6 a     2009  2009-02-01

and with .drop = FALSE I gain all the extra years for site "b" which were not in the original data:

df %>% 
  filter(value > 65) %>%
  group_by(site, year, .drop = FALSE) %>%
  summarise(f = first(date))
#> # A tibble: 20 x 3
#> # Groups:   site [2]
#>    site  year  f         
#>    <fct> <fct> <date>    
#>  1 a     2000  2000-04-01
#>  2 a     2001  NA        
#>  3 a     2002  NA        
#>  4 a     2003  NA        
#>  5 a     2004  2004-08-01
#>  6 a     2005  2005-01-01
#>  7 a     2006  NA        
#>  8 a     2007  2007-11-01
#>  9 a     2008  2008-10-01
#> 10 a     2009  2009-02-01
#> 11 b     2000  NA        
#> 12 b     2001  NA        
#> 13 b     2002  NA        
#> 14 b     2003  NA        
#> 15 b     2004  NA        
#> 16 b     2005  NA        
#> 17 b     2006  NA        
#> 18 b     2007  NA        
#> 19 b     2008  NA        
#> 20 b     2009  NA

The best way I could think of was to calculate counts, then merge then filter then drop the count variable, but that's pretty messy. I know the .drop was only recently added to dplyr, which is very useful for one factor, but is there yet a clean way to do this for multiple factors?

df %>% 
  filter(value > 65) %>%
  group_by(site, year, .drop = FALSE) %>%
  summarise(f = first(date)) %>%
  left_join(df %>% count(site, year, .drop = FALSE), by = c("site", "year")) %>%
  filter(n > 0) %>%
  select(-n)
#> # A tibble: 11 x 3
#> # Groups:   site [2]
#>    site  year  f         
#>    <fct> <fct> <date>    
#>  1 a     2000  2000-04-01
#>  2 a     2001  NA        
#>  3 a     2002  NA        
#>  4 a     2003  NA        
#>  5 a     2004  2004-08-01
#>  6 a     2005  2005-01-01
#>  7 a     2006  NA        
#>  8 a     2007  2007-11-01
#>  9 a     2008  2008-10-01
#> 10 a     2009  2009-02-01
#> 11 b     2000  NA

Upvotes: 1

Views: 1900

Answers (1)

Humpelstielzchen
Humpelstielzchen

Reputation: 6441

Not sure if this is what you like.

If you replace dates with value < 65 with NA instead of filtering them out you can proceed as usual.



df %>% 
  mutate(date = replace(date, value < 65, NA)) %>%
  group_by(site, year) %>%
  summarise(f = first(date[!is.na(date)]))

# A tibble: 11 x 3
# Groups:   site [2]
   site  year  f         
   <fct> <fct> <date>    
 1 a     2000  NA        
 2 a     2001  NA        
 3 a     2002  2002-03-01
 4 a     2003  NA        
 5 a     2004  NA        
 6 a     2005  NA        
 7 a     2006  2006-02-01
 8 a     2007  NA        
 9 a     2008  2008-07-01
10 a     2009  2009-02-01
11 b     2000  2000-08-01

Upvotes: 1

Related Questions