Alex C.
Alex C.

Reputation: 75

R: Special case of merging two datasets by date

I'm using R to analyze data from an animal shelter. I have a dataset with animal intakes, and another dataset showing outcomes. I would like to combine those datasets, so that each animal's intake information and corresponding outcome information are on the same row.

Each outcome has a corresponding earlier intake. Some intakes have no outcomes, because those animals are still in the system. A single animal can cycle through the system multiple times (e.g., the animal is surrendered to the shelter, adopted-out, returned to the shelter, adopted-out again, etc.)

The dataframes look something like this:

Intakes:

 Animal.ID Intake.Date Intake.Type
        A1  2016-01-01   Surrender
        A2  2017-01-01       Stray
        A1  2018-01-01   Surrender
        A3  2019-01-01       Stray
        A4  2020-01-01      Seized
        A5  2021-01-01   Surrender

Outcomes:

 Animal.ID Outcome.Date Outcome.Type
        A1   2016-06-30     Adoption
        A2   2017-06-30   Euthanasia
        A1   2018-06-30     Transfer
        A3   2019-06-30     Adoption
        A5   2021-06-30     Transfer

In my example, the animal with Animal.ID "A1" cycled through the system twice. The animal with Animal.ID "A4" has no outcome record, because the animal is still in the shelter's care.

How do I combine (merge) the datasets so that the resulting dataset looks like this?

Combined:

 Animal.ID Intake.Date Intake.Type Outcome.Date Outcome.Type
        A1  2016-01-01   Surrender   2016-06-30     Adoption
        A2  2017-01-01       Stray   2017-06-30   Euthanasia
        A1  2018-01-01   Surrender   2018-06-30     Transfer
        A3  2019-01-01       Stray   2019-06-30     Adoption
        A4  2020-01-01      Seized         <NA>         <NA>
        A5  2021-01-01   Surrender   2021-06-30     Transfer

I imagine that this type of problem has been solved in the past, but I must not be using the right terms when I run Google searches.

EDIT: The actual data contains date/times (not just dates). An outcome can occur within a few minutes of an intake, or up to many months later.

Here is the code that creates those sample datasets:

intakes <- data.frame(
  Animal.ID = c("A1","A2","A1","A3","A4","A5"),
  Intake.Date = as.Date(c("2016-01-01","2017-01-01","2018-01-01","2019-01-01","2020-01-01","2021-01-01")),
  Intake.Type = c("Surrender","Stray","Surrender","Stray","Seized","Surrender")
)

outcomes <- data.frame(
  Animal.ID = c("A1","A2","A1","A3","A5"),
  Outcome.Date = as.Date(c("2016-06-30","2017-06-30","2018-06-30","2019-06-30","2021-06-30")),
  Outcome.Type = c("Adoption","Euthanasia","Transfer","Adoption","Transfer")
)

Upvotes: 0

Views: 78

Answers (2)

walter
walter

Reputation: 528

The previous answer is along the right lines, but rather than sorting by date, I assume that intakes are associated with exactly one outcome, except the most recent intake for a given animal, which is associated with zero or one outcomes.

Hence, I suggest creating a new variable in each dataset, which is the (unique) number of the occurrence for each animal (Animal.ID.occurrence), and use this when joining.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

intakes <- data.frame(
    Animal.ID = c("A1","A2","A1","A3","A4","A5"),
    Intake.Date = as.Date(c("2016-01-01","2017-01-01","2018-01-01","2019-01-01","2020-01-01","2021-01-01")),
    Intake.Type = c("Surrender","Stray","Surrender","Stray","Seized","Surrender")
)

outcomes <- data.frame(
    Animal.ID = c("A1","A2","A1","A3","A5"),
    Outcome.Date = as.Date(c("2016-06-30","2017-06-30","2018-06-30","2019-06-30","2021-06-30")),
    Outcome.Type = c("Adoption","Euthanasia","Transfer","Adoption","Transfer")
)

intakes_occurrence <- intakes %>% group_by(Animal.ID) %>%
    arrange(Intake.Date) %>% 
    mutate(Animal.ID.occurrence = paste0(Animal.ID, ".", row_number())) %>% 
    ungroup()
outcomes_occurrence <- outcomes %>% group_by(Animal.ID) %>%
    arrange(Outcome.Date) %>% 
    mutate(Animal.ID.occurrence = paste0(Animal.ID, ".", row_number())) %>% 
    ungroup() %>% 
    select(-Animal.ID) 

intakes_occurrence %>% 
    full_join(outcomes_occurrence, by="Animal.ID.occurrence") %>% 
    select(-Animal.ID.occurrence)
#> # A tibble: 6 × 5
#>   Animal.ID Intake.Date Intake.Type Outcome.Date Outcome.Type
#>   <chr>     <date>      <chr>       <date>       <chr>       
#> 1 A1        2016-01-01  Surrender   2016-06-30   Adoption    
#> 2 A2        2017-01-01  Stray       2017-06-30   Euthanasia  
#> 3 A1        2018-01-01  Surrender   2018-06-30   Transfer    
#> 4 A3        2019-01-01  Stray       2019-06-30   Adoption    
#> 5 A4        2020-01-01  Seized      NA           <NA>        
#> 6 A5        2021-01-01  Surrender   2021-06-30   Transfer

Created on 2021-09-05 by the reprex package (v2.0.1)

Edited to sort by date.

Upvotes: 1

Brigadeiro
Brigadeiro

Reputation: 2945

You need to create a new variable to match upon because there is not enough data in the datasets as-is to determine a unique match for each row. It looks like you want to match by Animal.ID and then the intake year, so I created a new variable year, matched on both, then removed this from the final dataset. You can, of course, make this new variable whatever you want to incorporate more complex situations (e.g., Intake on 12-31-2020 with outcome 01-01-2021).

library(dplyr)
library(lubridate)

intakes %>%
  mutate(year = year(Intake.Date)) %>%
  left_join(mutate(outcomes, year = year(Outcome.Date)), by = c("Animal.ID", "year")) %>%
  select(-year)
  Animal.ID Intake.Date Intake.Type Outcome.Date Outcome.Type
1        A1  2016-01-01   Surrender   2016-06-30     Adoption
2        A2  2017-01-01       Stray   2017-06-30   Euthanasia
3        A1  2018-01-01   Surrender   2018-06-30     Transfer
4        A3  2019-01-01       Stray   2019-06-30     Adoption
5        A4  2020-01-01      Seized         <NA>         <NA>
6        A5  2021-01-01   Surrender   2021-06-30     Transfer

Upvotes: 1

Related Questions