Reputation: 113

Write code to group_by/summarise in dplyr more concisely?

I have a dataframe, call it "df" that contains all detections of about 250 named individual birds during the last 5 years, about 11000 rows. df has columns DATE, BIRD, YEAR, MONTH, DAY, and OUTCOME. The group_by/summarise command creates a new table "df2" where there is one row for each individual BIRD and the new columns, one per month, contain a "1" if the bird was seen at all that month and a "0" if it was not detected. The columns are named in the format "YYMM" so March 2020 appears as the colum "2003" in the new table. Right now the instructions to make the table take up 60+ lines. I write one line for each new column (50 months means my command gets 50 lines) - see below. Sample data:

df <- data.frame(DATE = as.Date(c("02/16/18","03/16/18","03/16/18","04/16/18","05/16/18","05/19/18"),
                                format = "%m/%d/%y"),
                 BIRD = c("emww","emww","oaam","bbcm","bbcm","bbcm"),
                 YEAR = c(2018,2018,2018,2018,2018,2018),
                 MONTH = c(02,03,03,04,05,05),
                 OUTCOME = c(1,0,1,1,0,0))

Code that works but which becomes so lengthy:

df2 <- df %>% 
  group_by(BIRD) %>% 
  summarise(
    "1802" = as.numeric(any(YEAR==2018 & MONTH == 2 & OUTCOME==1)),
    "1803" = as.numeric(any(YEAR==2018 & MONTH == 3 & OUTCOME==1)),
    "1804" = as.numeric(any(YEAR==2018 & MONTH == 4 & OUTCOME==1)),
    "1805" = as.numeric(any(YEAR==2018 & MONTH == 5 & OUTCOME==1)),
    "1806" = as.numeric(any(YEAR==2018 & MONTH == 6 & OUTCOME==1)),
    "1807" = as.numeric(any(YEAR==2018 & MONTH == 7 & OUTCOME==1)),
    "1808" = as.numeric(any(YEAR==2018 & MONTH == 8 & OUTCOME==1)))

(for the five years of the study there would be 60 lines like the above, with my only editing the column title, the year and the month in lines that are otherwise identical).

I would love to be able to do something like

startdate <- as.Date("02/16/18", format = "%m/%d/%y")
enddate <- as.Date("12/16/23", format = "%m/%d/%y")

And then have R write out my big block of group_by/summarize code for those months instead of my editing it by hand. Anyone have a pointer for how to do this (or a different more efficient approach)?

Upvotes: 1

Answers (3)

jay.sf

Reputation: 73692

You can make it short and sweet with reshape2::dcast.

> df |> transform(DATE=strftime(DATE, '%y%m')) |> 
+   reshape2::dcast(BIRD ~ DATE, fun=sum, value.var='OUTCOME')
  BIRD 1802 1803 1804 1805
1 bbcm    0    0    1    0
2 emww    1    0    0    0
3 oaam    0    1    0    0

Using sum as function assuming summing up by month was intended.

Edit

According to your comments we can wrap it in sign to get binaries.

> df |> transform(DATE=strftime(DATE, '%y%m')) |> 
+   reshape2::dcast(BIRD ~ DATE, fun=\(x) sign(sum(x)), value.var='OUTCOME')
  BIRD 1802 1803 1804 1805
1 bbcm    0    0    1    0
2 emww    1    0    0    0
3 oaam    0    1    0    0

Data:

> dput(df)
structure(list(DATE = structure(c(17578, 17606, 17606, 17637, 
17667, 17670), class = "Date"), BIRD = c("emww", "emww", "oaam", 
"bbcm", "bbcm", "bbcm"), YEAR = c(2018, 2018, 2018, 2018, 2018, 
2018), MONTH = c(2, 3, 3, 4, 5, 5), OUTCOME = c(1, 0, 1, 1, 0, 
0)), class = "data.frame", row.names = c(NA, -6L))

Upvotes: 1

Nir Graham

Reputation: 5167

My solution involves your start and end date parameters to set up how much data to invent either side of the recorded data.

library(tidyverse)

df <- data.frame(
  DATE = as.Date(c("02/16/18", "03/16/18", "03/16/18", "04/16/18", "05/16/18", "05/19/18"),
    format = "%m/%d/%y"
  ),
  BIRD = c("emww", "emww", "oaam", "bbcm", "bbcm", "bbcm"),
  YEAR = c(2018, 2018, 2018, 2018, 2018, 2018),
  MONTH = c(02, 03, 03, 04, 05, 05),
  OUTCOME = c(1, 0, 1, 1, 0, 0)
)

startdate <- as.Date("02/16/18", format = "%m/%d/%y")
enddate <- as.Date("12/16/23", format = "%m/%d/%y")

# solution starts here 
# get the 1802 .... type month labels we ultimately want to produce
(datevec <- enframe(seq(startdate, enddate, by = "month")) |> mutate(
  ym = 100 * (year(value) %% 100) + month(value)
) |> pull(ym) |> unique())

# use tidyr::complete to populate absent ym dates.
(df2 <- df %>% mutate(ym = factor(100 * (YEAR %% 100) + MONTH,
  levels = datevec
)) |>
  select(-YEAR, -MONTH, -DATE) |> 
  summarise(seen = 1 * any(OUTCOME == 1),.by = c(BIRD,ym)) |>
  tidyr::complete(BIRD, ym, fill = list(seen = 0)) |>
  pivot_wider(
    names_from = "ym",
    values_from = "seen"
  )
)

Upvotes: 0

Jon Spring

Reputation: 66935

library(tidyverse)
df |>
  mutate(time = paste0(YEAR - 2000, str_pad(MONTH, width = 2, pad = "0"))) |>
  select(-YEAR, -MONTH) |>
  summarize(value = 1*any(OUTCOME == 1), .by = c(BIRD, time)) |>
  pivot_wider(names_from = time, values_from = value, values_fill = 0)

Result

# A tibble: 3 × 5
  BIRD  `1802` `1803` `1804` `1805`
  <chr>  <dbl>  <dbl>  <dbl>  <dbl>
1 emww       1      0      0      0
2 oaam       0      1      0      0
3 bbcm       0      0      1      0

Upvotes: 1

Write code to group_by/summarise in dplyr more concisely?

Answers (3)

Edit

Related Questions