Reputation: 75
I'm using R to analyze data from an animal shelter. I have a dataset with animal intakes, and another dataset showing outcomes. I would like to combine those datasets, so that each animal's intake information and corresponding outcome information are on the same row.
Each outcome has a corresponding earlier intake. Some intakes have no outcomes, because those animals are still in the system. A single animal can cycle through the system multiple times (e.g., the animal is surrendered to the shelter, adopted-out, returned to the shelter, adopted-out again, etc.)
The dataframes look something like this:
Intakes:
Animal.ID Intake.Date Intake.Type
A1 2016-01-01 Surrender
A2 2017-01-01 Stray
A1 2018-01-01 Surrender
A3 2019-01-01 Stray
A4 2020-01-01 Seized
A5 2021-01-01 Surrender
Outcomes:
Animal.ID Outcome.Date Outcome.Type
A1 2016-06-30 Adoption
A2 2017-06-30 Euthanasia
A1 2018-06-30 Transfer
A3 2019-06-30 Adoption
A5 2021-06-30 Transfer
In my example, the animal with Animal.ID "A1" cycled through the system twice. The animal with Animal.ID "A4" has no outcome record, because the animal is still in the shelter's care.
How do I combine (merge) the datasets so that the resulting dataset looks like this?
Combined:
Animal.ID Intake.Date Intake.Type Outcome.Date Outcome.Type
A1 2016-01-01 Surrender 2016-06-30 Adoption
A2 2017-01-01 Stray 2017-06-30 Euthanasia
A1 2018-01-01 Surrender 2018-06-30 Transfer
A3 2019-01-01 Stray 2019-06-30 Adoption
A4 2020-01-01 Seized <NA> <NA>
A5 2021-01-01 Surrender 2021-06-30 Transfer
I imagine that this type of problem has been solved in the past, but I must not be using the right terms when I run Google searches.
EDIT: The actual data contains date/times (not just dates). An outcome can occur within a few minutes of an intake, or up to many months later.
Here is the code that creates those sample datasets:
intakes <- data.frame(
Animal.ID = c("A1","A2","A1","A3","A4","A5"),
Intake.Date = as.Date(c("2016-01-01","2017-01-01","2018-01-01","2019-01-01","2020-01-01","2021-01-01")),
Intake.Type = c("Surrender","Stray","Surrender","Stray","Seized","Surrender")
)
outcomes <- data.frame(
Animal.ID = c("A1","A2","A1","A3","A5"),
Outcome.Date = as.Date(c("2016-06-30","2017-06-30","2018-06-30","2019-06-30","2021-06-30")),
Outcome.Type = c("Adoption","Euthanasia","Transfer","Adoption","Transfer")
)
Upvotes: 0
Views: 78
Reputation: 528
The previous answer is along the right lines, but rather than sorting by date, I assume that intakes are associated with exactly one outcome, except the most recent intake for a given animal, which is associated with zero or one outcomes.
Hence, I suggest creating a new variable in each dataset, which is the (unique) number of the occurrence for each animal (Animal.ID.occurrence
), and use this when joining.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
intakes <- data.frame(
Animal.ID = c("A1","A2","A1","A3","A4","A5"),
Intake.Date = as.Date(c("2016-01-01","2017-01-01","2018-01-01","2019-01-01","2020-01-01","2021-01-01")),
Intake.Type = c("Surrender","Stray","Surrender","Stray","Seized","Surrender")
)
outcomes <- data.frame(
Animal.ID = c("A1","A2","A1","A3","A5"),
Outcome.Date = as.Date(c("2016-06-30","2017-06-30","2018-06-30","2019-06-30","2021-06-30")),
Outcome.Type = c("Adoption","Euthanasia","Transfer","Adoption","Transfer")
)
intakes_occurrence <- intakes %>% group_by(Animal.ID) %>%
arrange(Intake.Date) %>%
mutate(Animal.ID.occurrence = paste0(Animal.ID, ".", row_number())) %>%
ungroup()
outcomes_occurrence <- outcomes %>% group_by(Animal.ID) %>%
arrange(Outcome.Date) %>%
mutate(Animal.ID.occurrence = paste0(Animal.ID, ".", row_number())) %>%
ungroup() %>%
select(-Animal.ID)
intakes_occurrence %>%
full_join(outcomes_occurrence, by="Animal.ID.occurrence") %>%
select(-Animal.ID.occurrence)
#> # A tibble: 6 × 5
#> Animal.ID Intake.Date Intake.Type Outcome.Date Outcome.Type
#> <chr> <date> <chr> <date> <chr>
#> 1 A1 2016-01-01 Surrender 2016-06-30 Adoption
#> 2 A2 2017-01-01 Stray 2017-06-30 Euthanasia
#> 3 A1 2018-01-01 Surrender 2018-06-30 Transfer
#> 4 A3 2019-01-01 Stray 2019-06-30 Adoption
#> 5 A4 2020-01-01 Seized NA <NA>
#> 6 A5 2021-01-01 Surrender 2021-06-30 Transfer
Created on 2021-09-05 by the reprex package (v2.0.1)
Edited to sort by date.
Upvotes: 1
Reputation: 2945
You need to create a new variable to match upon because there is not enough data in the datasets as-is to determine a unique match for each row. It looks like you want to match by Animal.ID
and then the intake year, so I created a new variable year
, matched on both, then removed this from the final dataset. You can, of course, make this new variable whatever you want to incorporate more complex situations (e.g., Intake on 12-31-2020 with outcome 01-01-2021).
library(dplyr)
library(lubridate)
intakes %>%
mutate(year = year(Intake.Date)) %>%
left_join(mutate(outcomes, year = year(Outcome.Date)), by = c("Animal.ID", "year")) %>%
select(-year)
Animal.ID Intake.Date Intake.Type Outcome.Date Outcome.Type
1 A1 2016-01-01 Surrender 2016-06-30 Adoption
2 A2 2017-01-01 Stray 2017-06-30 Euthanasia
3 A1 2018-01-01 Surrender 2018-06-30 Transfer
4 A3 2019-01-01 Stray 2019-06-30 Adoption
5 A4 2020-01-01 Seized <NA> <NA>
6 A5 2021-01-01 Surrender 2021-06-30 Transfer
Upvotes: 1