summarize original data based on expand grid categories

Question

I would like to summarize a table using dplyr. Here is how I would like to proceed:

I have a data.frame like this:

 year    region week  site           species    gps_clutch
2017    sud   18     6                  au        337
2017    sud   20     10                 au        352
2017    sud   22     10                 au        352
2017    sud   24     10                 au        352
2017    sud   18     6                  aio       337
2017    sud   20     6                  aio       352
2017    sud   22     6                  au        352
2018    sud   20     6                  au        337
2018    sud   20     10                 au        352
2018    sud   22     10                 au        352
2018    sud   22     10                 aio       352
2018    sud   22     6                  au        352
2017    nor   19     5                  au        337
2017    nor   21     2                  au        352
2017    nor   23     5                  au        352
2017    nor   25     2                  au        352
2017    nor   19     5                  aio       337
2017    nor   25     5                  aio       352
2017    nor   19     5                  au        337
2018    nor   21     2                  aio       352
2018    nor   23     5                  aio        352
2018    nor   25     2                  au        352
2018    nor   23     5                  aio       337
2018    nor   23     5                  au       352

I would like to count the number of "gps_clutch" for each year, region, site, week and expand this all the possible weeks recorded for each region. I explain: in the region "sud" I sampled week 18, 20, 22, 24 and in the region "nor" week 19, 21, 23, 25. I would like to convert implicit missing values by "0" but only for the weeks (nested in regions) that have been sampled. I do not want to expand in a way that I would get a row for week 19 in region "sud" because this region was not sampled that specific week.

this code works well to expand the grid as I would like:

dat %>%
  group_by(region) %>%
  expand(year,site, species,week)

the following code works too, to get the count values but does not expand the grid as I wish (I only get the list of weeks for which I did observe something for each year, not the total number of weeks sampled across both years). Which mean that if in "sud" "2017" I only have records for weeks 20 and 22, the grid will not get expanded to week 18 and 24 :

field_subsetnord %>%
  group_by(year,region,site,species,week) %>%
  summarise(count_clutch=length(gps_clutch)) %>% 
  complete(week,nesting(year,sites,species), fill = list(count_clutch = 0))

this is the table I would like to get at the end:

 year    region week  site           species    count
2017     sud    18     6             au         1
2017     sud    20     6             au         0
2017     sud    22     6             au         1
2017     sud    24     6             au         0

2017     sud    18     6             aio        1
2017     sud    20     6             aio        1
2017     sud    22     6             aio        0
2017     sud    24     6             aio        0

2017     sud    18     10            au         0
2017     sud    20     10            au         1
2017     sud    22     10            au         1
2017     sud    24     10            au         1

2017     sud    18     10            aio        0
2017     sud    20     10            aio        0
2017     sud    22     10            aio        0
2017     sud    24     10            aio        0

2018     sud    18     6             au        0
2018     sud    20     6             au        1
2018     sud    22     6             au        1
2018     sud    24     6             au        0

2018     sud    18     6             aio       0
2018     sud    20     6             aio       0
2018     sud    22     6             aio       0 
2018     sud    24     6             aio       0

2018     sud    18     10            au        0
2018     sud    20     10            au        1
2018     sud    22     10            au        1
2018     sud    24     10            au        0

2018     sud    18     10            aio       0
2018     sud    20     10            aio       0
2018     sud    22     10            aio       1
2018     sud    24     10            aio       0

and so on for 2018...

any suggestions to mix these two codes would be appreciated :)

aosmith · Accepted Answer

You are so close with your two approaches. Essentially they just need to be combined to get what you're after. :)

Group by region and then complete() the dataset first, then regroup by all variables and summarise(). Since the gps_clutch will now have missing values in it, you can sum up the non-missing values (via !is.na) in the summarise() statement to count the clutches.

dat %>%
    group_by(region) %>%
    complete(year, site, species, week) %>% 
    group_by(year, region, site, species, week) %>%
    summarise(count_clutch = sum( !is.na(gps_clutch) ) )

# A tibble: 64 x 6
# Groups:   year, region, site, species [16]
    year region  site species  week count_clutch
                  
 1  2017 nor        2 aio        19            0
 2  2017 nor        2 aio        21            0
 3  2017 nor        2 aio        23            0
 4  2017 nor        2 aio        25            0
 5  2017 nor        2 au         19            0
 6  2017 nor        2 au         21            1
 7  2017 nor        2 au         23            0
 8  2017 nor        2 au         25            1
 9  2017 nor        5 aio        19            1
10  2017 nor        5 aio        21            0
# ... with 54 more rows

summarize original data based on expand grid categories

Answers (1)

Related Questions