Pierce Draisma
Pierce Draisma

Reputation: 3

Joining two tables based on non-matching date fields

There are two tables (A and B) that I want to join by date. This is complicated by the fact that the dates do not necessarily line up neatly across tables. That is to say, dates from table B might fall outside or between the entries in table A.

I suspect that there must be an easy way to accomplish this in SQL/SAS, but I’m very new to both and don’t see how. I would be very glad if someone could point me to concrete solutions, examples, or functions that I can use to solve this problem. I created a fictitious case below to illustrate.

This is an example of how table A might look like (for one participant):

Table A
-------------------------------------------+
participant start       end
-------------------------------------------+
101         1-1-2010    26-4-2010
101         27-4-2010   2-10-2014
101         3-10-2014   4-1-2015
101         5-1-2015    31-8-2015
101         1-9-2015    12-10-2016
101         13-10-2016  31-12-2018

Below is an example of table B that needs to connect to table A. As you can see, the dates differ too much for a simple left join:

Table B
---------------------------------------------------------+
participant start_date  end_date    Content
---------------------------------------------------------+
101         1-1-2012    31-8-2012   A
101         1-9-2012    31-8-2013   B
101         1-9-2013    31-8-2014   C
101         1-9-2014    2-10-2014   D
101         3-10-2014   31-8-2015   E
101         1-9-2015    31-1-2016   F
101         1-9-2015    31-1-2016   F

The idea for the joined table C is that every row of table A is informed by data from table B. I want to select an entry from B that falls within the span of table A. If several entries from B fit, the most recent one should be used. If table B doesn’t have information from that period (as is the case in the first row), the closest information should be used. A different way to put it is that I want the most recent information of B added to each row of A.

Table C
----------------------------------------------------------------------+
participant startA      endA        startB      endB        Content
----------------------------------------------------------------------+
101         1-1-2010    26-4-2010   1-1-2012    31-8-2012   A
101         27-4-2010   2-10-2014   1-9-2013    31-8-2014   C
101         3-10-2014   4-1-2015    1-9-2014    2-10-2014   D
101         5-1-2015    31-8-2015   3-10-2014   31-8-2015   E
101         1-9-2015    12-10-2016  1-9-2015    31-1-2016   F
101         13-10-2016  31-12-2018  1-9-2015    31-1-2016   F

This is the first time that I’m working with SAS and SQL, so my own efforts work very poorly. Below, I’m joining these two tables in a procedure with several steps: I’m first creating a full join to get all possible (relevant) permutations of table A and B. Then I calculate the date difference between the data from table A and B. Finally, for each period of A, I’m selecting the row where there is minimal difference in dates between the data from the original tables.

/* Create outer join of both tables*/
PROC SQL;
    CREATE TABLE work.fulljoin AS
    SELECT a.*, b.* 
    FROM work.table_A AS a
    FULL JOIN work.table_B AS b ON a.participant = b.participant;
quit;

/* Group by ID and entry date of each period */
PROC SORT data=work.fulljoin;
    BY participant startA; 
RUN;

/* Calculate the date differences between tables A and B */
DATA work.fulljoin_wdelta;
    SET work.fulljoin;
    delta=abs(endA-endB);
RUN;

/* Remove unnecessary rows */
PROC SQL;
    CREATE TABLE output.joined AS
    SELECT * FROM work.fulljoin_wdelta
    GROUP BY participant, startA
    HAVING delta=min(delta);
QUIT;

However with large datasets (millions of rows in A and B) this becomes prohibitive. Also, this method doesn’t strictly speaking enforce that you will get the most recent B data for every A period, just the one that’s closest in end date.

Upvotes: 0

Views: 1441

Answers (2)

Richard
Richard

Reputation: 27508

Date range joins can be very complex when you have to deal with ties, maximal coverage versus toe-in-the-water overlaps, etc... You certainly don't want to store the union and intermediaries in the final solution, although they can be helpful while debugging logic.

Here is a correlated sub-query technique that looks up the 'best' content range matching A. It will have problems if the content data is not distinct end_date with in participant.

Every one row (target) has a lookup done. The range overlap logic is important

      where one.participant = two.participant
        and two.start_date < one.end
        and two.end_date > one.start

and allows a content date range to be partially outside the target range.

data one;
input participant start: ddmmyy. end: ddmmyy.;

format start end yymmdd10.;

datalines;
101         1-1-2010    26-4-2010
101         27-4-2010   2-10-2014
101         3-10-2014   4-1-2015
101         5-1-2015    31-8-2015
101         1-9-2015    12-10-2016
101         13-10-2016  31-12-2018
;

data two;
input participant start_date: ddmmyy.  end_date: ddmmyy.   Content: $;

format start_date end_date yymmdd10.;

datalines;
101         1-1-2012    31-8-2012   A
101         1-9-2012    31-8-2013   B
101         1-9-2013    31-8-2014   C
101         1-9-2014    2-10-2014   D
101         3-10-2014   31-8-2015   E
101         1-9-2015    31-1-2016   F
101         1-9-2015    31-1-2017   F
run;

proc sql;
  create table want as 
  select 
    one.*
  , ( select min(content)
      from two 
      where one.participant = two.participant
        and two.start_date < one.end
        and two.end_date > one.start
      group by participant
      having end_date = max(end_date)
    ) as content
  from
    one
  order by
    participant, start
  ;
quit;

Upvotes: 0

user667489
user667489

Reputation: 9569

I think you can simplify your existing logic into a single query:

proc sql noprint _method;
  create table table_c as
    select 
      a.participant, 
      a.start as start_a, 
      a.end as end_a, 
      b.start_date as start_b, 
      b.end_date as end_b,
      abs(a.end - b.end_date) as delta
    from table_a a inner join table_b b 
    on a.participant = b.participant
    group by a.participant, start_a
    having delta = min(delta)
    ;     
quit;

The log output confirms that this performs a hash join as long as you have enough memory:

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxsumg
              sqxsort
                  sqxjhsh
                      sqxsrc( WORK.TABLE_B(alias = B) )
                      sqxsrc( WORK.TABLE_A(alias = A) )

If the resulting table is different from what you're trying to produce, please clarify.

Upvotes: 0

Related Questions