Reputation: 51

Merging two data frames by time range in R

I am working with cattle fertility data. In one table (data frame), what I have is a record of all the services performed in a cow (like inseminations). In a different table, I get the pregnancy diagnosis (positive or negative). Both have an unique ID (animal_id). My challenge has been successfully merging both tables in the right data range, meaning what I need is the pregnancy check associated with the right insemination record. Here is a sample of how both tables look like,

animal_id     service_date
610710        2005-10-22
610710        2006-12-03
610710        2006-12-27
610710        2007-12-02
610710        2008-01-17
610710        2008-03-04

The other table is the same but with a different date (event_date) and the diagnosis,

 animal_id     event_date        event_description
    610710     2006-06-16           PP
    610710     2007-02-15           PP
    610710     2008-01-09           PN
    610710     2008-04-09           PP
    610710     2009-06-16           PP

So what I would like to do is merge both tables in a way the dates complement each other, meaning if a service was performed on 2005-10-12, when I join both tables this row will link to the closest date in the Events table, and by closest I also mean later - since insemination happens before diagnosis. So the desired output would be something like this,

    animal_id    service_date       event_date     event_description
 1   610710       2005-10-22              NA               NA
 2   610710           NA              2006-06-16           PP
 3   610710       2006-12-03          2007-02-15           PP
 4   610710       2006-12-27          2007-02-15           PP
 5   610710       2007-12-02          2008-01-09           PN
 6   610710       2008-01-17          2008-04-09           PP
 7   610710       2008-03-04              NA               NA  
 8   610710           NA              2009-06-16           PP

In the final output, I would expect a large number of records not to merge against anything, like row 1 in the example output. There was a service performed in October 2005, but the first Diagnosis I have for that cow is in June 2006 - there are probably a number of service records missing. That is unfortunately to be expected. For this example, only rows 5 and 6 make sense. For rows 3 and 4, I would consider only row 4, since that is probably the insemination that resulted into pregnancy.

Is that even possible in R?

Thank you!

Upvotes: 2

Answers (2)

r2evans

Reputation: 160817

What you're asking for is a "non-equi" or "range" join. ~~This isn't supported by base R (or dplyr, lacking dbplyr), but can be done with some other packages.~~

EDIT: recent versions of dplyr now support range-based joins.

For all, I create event_date_lag so that we limit the amount of returns for each row. (Without it, we'd get multiple matches.)

fuzzyjoin

out <- fuzzyjoin::fuzzy_full_join(
  services, events,
  by = c("animal_id" = "animal_id",
         "service_date" = "event_date_lag",
         "service_date" = "event_date"),
  match_fun = list(`==`, `>=`, `<=`))
# not sure why fuzzyjoin is splitting animal_id
out <- transform(out, animal_id = ifelse(is.na(animal_id.x), animal_id.y, animal_id.x))
out$animal_id.x <- out$animal_id.y <- out$event_date_lag <- NULL
# ordering here primarily to compare with your desired output
out[with(out, order(ifelse(is.na(service_date), event_date, service_date))),]
#   service_date event_date event_description animal_id
# 6   2005-10-22       <NA>              <NA>    610710
# 7         <NA> 2006-06-16                PP    610710
# 1   2006-12-03 2007-02-15                PP    610710
# 2   2006-12-27 2007-02-15                PP    610710
# 3   2007-12-02 2008-01-09                PN    610710
# 4   2008-01-17 2008-04-09                PP    610710
# 5   2008-03-04 2008-04-09                PP    610710
# 8         <NA> 2009-06-16                PP    610710

sqldf

SQL in general supports the concept of non-equi or range joins. There's nothing special about the sqldf package, just that it provides a native SQL experience (via RSQLite) without the overhead or hassle of uploading your data to a SQL DBMS and pulling it back down in this query. While that is in fact what is happening with sqldf, it automates much of it, allowing one to work directly on R objects using SQL.

If by chance you are already getting your data from a DBMS, then a SQL join is by far the most efficient: get it joined at the source.

sqldf::sqldf(
  "select svc.animal_id, svc.service_date,
     ev.event_date, ev.event_description
   from services svc
     left join events ev on svc.animal_id=ev.animal_id
       and svc.service_date between ev.event_date_lag and ev.event_date
   order by svc.service_date, ev.event_date")
#   animal_id service_date event_date event_description
# 1    610710   2005-10-22       <NA>              <NA>
# 2    610710   2006-12-03 2007-02-15                PP
# 3    610710   2006-12-27 2007-02-15                PP
# 4    610710   2007-12-02 2008-01-09                PN
# 5    610710   2008-01-17 2008-04-09                PP
# 6    610710   2008-03-04 2008-04-09                PP

data.table

While I use this often, if you aren't already using it, then it might be a little more than you need (its learning curve, though worth it, can be steep).

Notes:

the data.table-semantics (Y[X], which is effectively "X left join Y") supports inner, left, and right, but not full, semi, or anti-joins. While it might be possible using a cross-join (cartesian product), that explodes memory use and is (imo) not the best way to go.
the join tends to rename the left side (the X in Y[X]) variables to that on the right. This can be confusing, and it can in fact mask the actual pre-merge values, so I'll duplicate service_date to keep it separate.
I'm using as.data.table here just for the SO answer, not because it's required to distinguish between data.frame and data.table variables. If you're switching to data.table, then setDT is the canonical way to go.
If you choose this but do not continue with other data.table operations, then make sure you convert back to normal data.frame using setDF or as.data.frame; there are enough subtle differences that not doing this will be a problem.

library(data.table)
svcDT <- as.data.table(services)
evDT <- as.data.table(events)
evDT[svcDT[,sdate:=service_date],
     on = .(animal_id == animal_id, event_date_lag <= sdate, event_date >= sdate)
     ][, event_date_lag := NULL ]
#    animal_id event_date event_description service_date
# 1:    610710 2005-10-22              <NA>   2005-10-22
# 2:    610710 2006-12-03                PP   2006-12-03
# 3:    610710 2006-12-27                PP   2006-12-27
# 4:    610710 2007-12-02                PN   2007-12-02
# 5:    610710 2008-01-17                PP   2008-01-17
# 6:    610710 2008-03-04                PP   2008-03-04

dplyr

(A recent addition.)

library(dplyr)
services %>%
  left_join(events, join_by(animal_id, service_date >= event_date_lag, 
                            service_date <= event_date))
#   animal_id service_date event_date event_description event_date_lag
# 1    610710   2005-10-22       <NA>              <NA>           <NA>
# 2    610710   2006-12-03 2007-02-15                PP     2006-06-16
# 3    610710   2006-12-27 2007-02-15                PP     2006-06-16
# 4    610710   2007-12-02 2008-01-09                PN     2007-02-15
# 5    610710   2008-01-17 2008-04-09                PP     2008-01-09
# 6    610710   2008-03-04 2008-04-09                PP     2008-01-09

Data

services <- read.table(header = TRUE, text = "
animal_id     service_date
610710        2005-10-22
610710        2006-12-03
610710        2006-12-27
610710        2007-12-02
610710        2008-01-17
610710        2008-03-04")
services$service_date <- as.Date(services$service_date)

events <- read.table(header = TRUE, text = "
 animal_id     event_date        event_description
    610710     2006-06-16           PP
    610710     2007-02-15           PP
    610710     2008-01-09           PN
    610710     2008-04-09           PP
    610710     2009-06-16           PP")
events$event_date <- as.Date(events$event_date)
events$event_date_lag <- ave(events$event_date, events$animal_id, FUN=function(a) c(a[1][NA], a[-length(a)]))
events
#   animal_id event_date event_description event_date_lag
# 1    610710 2006-06-16                PP           <NA>
# 2    610710 2007-02-15                PP     2006-06-16
# 3    610710 2008-01-09                PN     2007-02-15
# 4    610710 2008-04-09                PP     2008-01-09
# 5    610710 2009-06-16                PP     2008-04-09

Upvotes: 2

G. Grothendieck

Reputation: 270195

Using the input shown reproducibly in the Note at the end bind them together using rbind_rows and then sort them by date using arrange. Then define the logical column collapse which is TRUE if the current row has a service_date and the next row has an event_date and they are less than or equal to 90 days apart -- change 90 to whatever you want. Then group by animal_id and a group number which increases by 1 each time a service_date is encountered and further group by rows except if the current row has collapse equal to TRUE then place it in the same group as the next row in order that it be matched to that next row's event_date. Finally summarize the groups and remove the temporary columns.

Note that this approach maintains the event rows that do not have corresponding service dates and also ensures that each event date is not matched to more than one service date.

library(dplyr)

bind_rows(DF1, DF2) %>%
  arrange(coalesce(service_date, event_date)) %>%
  group_by(animal_id, group = cumsum(!is.na(service_date))) %>%
  mutate(collapse = !is.na(service_date) & !is.na(lead(event_date)) & 
       lead(event_date) - service_date <= 90) %>%
  group_by(n = 1:n() + collapse, .add = TRUE) %>%
  summarize(animal_id = first(animal_id), 
    service_date = first(service_date), 
    event_date = last(event_date), 
    event_description = last(event_description), .groups = "drop") %>%
  select(-group, -n)

giving:

# A tibble: 8 x 4
  animal_id service_date event_date event_description
      <int> <date>       <date>     <chr>            
1    610710 2005-10-22   NA         <NA>             
2    610710 NA           2006-06-16 PP               
3    610710 2006-12-03   NA         <NA>             
4    610710 2006-12-27   2007-02-15 PP               
5    610710 2007-12-02   2008-01-09 PN               
6    610710 2008-01-17   NA         <NA>             
7    610710 2008-03-04   2008-04-09 PP               
8    610710 NA           2009-06-16 PP

sqldf

We can follow pretty much the same logic using the sqldf package:

library(sqldf)

sqldf("with b0 as 
        (select *, NULL event_date, NULL event_description from DF1 
        union 
        select animal_id, NULL service_date, event_date, event_description from DF2),
       b1 as (select *, coalesce(service_date, event_date) date1 
           from both order by animal_id, date1),
       b2 as (select *, lead(event_date) over () lead_event_date 
               from b1),
       b3 as (select *, coalesce(lead_event_date - service_date <= 90, 0) + 
                  row_number() over () coll 
                from b2)
      select distinct animal_id,
             group_concat(service_date) service_date, 
             group_concat(event_date) event_date, 
             group_concat(event_description) event_description 
        from b3 group by coll")

giving:

  animal_id service_date event_date event_description
1    610710   2005-10-22       <NA>              <NA>
2    610710         <NA> 2006-06-16                PP
3    610710   2006-12-03       <NA>              <NA>
4    610710   2006-12-27 2007-02-15                PP
5    610710   2007-12-02 2008-01-09                PN
6    610710   2008-01-17       <NA>              <NA>
7    610710   2008-03-04 2008-04-09                PP
8    610710         <NA> 2009-06-16                PP

Note

DF1 <- structure(list(animal_id = c(610710L, 610710L, 610710L, 610710L, 
610710L, 610710L), service_date = structure(c(13078, 13485, 13509, 
13849, 13895, 13942), class = "Date")), row.names = c(NA, -6L
), class = "data.frame")

DF2 <- structure(list(animal_id = c(610710L, 610710L, 610710L, 610710L, 
610710L), event_date = structure(c(13315, 13559, 13887, 13978, 
14411), class = "Date"), event_description = c("PP", "PP", "PN", 
"PP", "PP")), row.names = c(NA, -5L), class = "data.frame")

Upvotes: 0

Merging two data frames by time range in R

Answers (2)

fuzzyjoin

sqldf

data.table

dplyr

Data

sqldf

Note

Related Questions