user3016665
user3016665

Reputation: 149

summarize original data based on expand grid categories

I would like to summarize a table using dplyr. Here is how I would like to proceed:

 year    region week  site           species    gps_clutch
2017    sud   18     6                  au        337
2017    sud   20     10                 au        352
2017    sud   22     10                 au        352
2017    sud   24     10                 au        352
2017    sud   18     6                  aio       337
2017    sud   20     6                  aio       352
2017    sud   22     6                  au        352
2018    sud   20     6                  au        337
2018    sud   20     10                 au        352
2018    sud   22     10                 au        352
2018    sud   22     10                 aio       352
2018    sud   22     6                  au        352
2017    nor   19     5                  au        337
2017    nor   21     2                  au        352
2017    nor   23     5                  au        352
2017    nor   25     2                  au        352
2017    nor   19     5                  aio       337
2017    nor   25     5                  aio       352
2017    nor   19     5                  au        337
2018    nor   21     2                  aio       352
2018    nor   23     5                  aio        352
2018    nor   25     2                  au        352
2018    nor   23     5                  aio       337
2018    nor   23     5                  au       352

this code works well to expand the grid as I would like:

dat %>%
  group_by(region) %>%
  expand(year,site, species,week)

the following code works too, to get the count values but does not expand the grid as I wish (I only get the list of weeks for which I did observe something for each year, not the total number of weeks sampled across both years). Which mean that if in "sud" "2017" I only have records for weeks 20 and 22, the grid will not get expanded to week 18 and 24 :

field_subsetnord %>%
  group_by(year,region,site,species,week) %>%
  summarise(count_clutch=length(gps_clutch)) %>% 
  complete(week,nesting(year,sites,species), fill = list(count_clutch = 0))

this is the table I would like to get at the end:

 year    region week  site           species    count
2017     sud    18     6             au         1
2017     sud    20     6             au         0
2017     sud    22     6             au         1
2017     sud    24     6             au         0

2017     sud    18     6             aio        1
2017     sud    20     6             aio        1
2017     sud    22     6             aio        0
2017     sud    24     6             aio        0

2017     sud    18     10            au         0
2017     sud    20     10            au         1
2017     sud    22     10            au         1
2017     sud    24     10            au         1

2017     sud    18     10            aio        0
2017     sud    20     10            aio        0
2017     sud    22     10            aio        0
2017     sud    24     10            aio        0

2018     sud    18     6             au        0
2018     sud    20     6             au        1
2018     sud    22     6             au        1
2018     sud    24     6             au        0

2018     sud    18     6             aio       0
2018     sud    20     6             aio       0
2018     sud    22     6             aio       0 
2018     sud    24     6             aio       0

2018     sud    18     10            au        0
2018     sud    20     10            au        1
2018     sud    22     10            au        1
2018     sud    24     10            au        0

2018     sud    18     10            aio       0
2018     sud    20     10            aio       0
2018     sud    22     10            aio       1
2018     sud    24     10            aio       0

and so on for 2018...

any suggestions to mix these two codes would be appreciated :)

Upvotes: 0

Views: 266

Answers (1)

aosmith
aosmith

Reputation: 36086

You are so close with your two approaches. Essentially they just need to be combined to get what you're after. :)

Group by region and then complete() the dataset first, then regroup by all variables and summarise(). Since the gps_clutch will now have missing values in it, you can sum up the non-missing values (via !is.na) in the summarise() statement to count the clutches.

dat %>%
    group_by(region) %>%
    complete(year, site, species, week) %>% 
    group_by(year, region, site, species, week) %>%
    summarise(count_clutch = sum( !is.na(gps_clutch) ) )

# A tibble: 64 x 6
# Groups:   year, region, site, species [16]
    year region  site species  week count_clutch
   <int> <fct>  <int> <fct>   <int>        <int>
 1  2017 nor        2 aio        19            0
 2  2017 nor        2 aio        21            0
 3  2017 nor        2 aio        23            0
 4  2017 nor        2 aio        25            0
 5  2017 nor        2 au         19            0
 6  2017 nor        2 au         21            1
 7  2017 nor        2 au         23            0
 8  2017 nor        2 au         25            1
 9  2017 nor        5 aio        19            1
10  2017 nor        5 aio        21            0
# ... with 54 more rows

Upvotes: 0

Related Questions