Reputation: 3
There are two tables (A and B) that I want to join by date. This is complicated by the fact that the dates do not necessarily line up neatly across tables. That is to say, dates from table B might fall outside or between the entries in table A.
I suspect that there must be an easy way to accomplish this in SQL/SAS, but I’m very new to both and don’t see how. I would be very glad if someone could point me to concrete solutions, examples, or functions that I can use to solve this problem. I created a fictitious case below to illustrate.
This is an example of how table A might look like (for one participant):
Table A
-------------------------------------------+
participant start end
-------------------------------------------+
101 1-1-2010 26-4-2010
101 27-4-2010 2-10-2014
101 3-10-2014 4-1-2015
101 5-1-2015 31-8-2015
101 1-9-2015 12-10-2016
101 13-10-2016 31-12-2018
Below is an example of table B that needs to connect to table A. As you can see, the dates differ too much for a simple left join:
Table B
---------------------------------------------------------+
participant start_date end_date Content
---------------------------------------------------------+
101 1-1-2012 31-8-2012 A
101 1-9-2012 31-8-2013 B
101 1-9-2013 31-8-2014 C
101 1-9-2014 2-10-2014 D
101 3-10-2014 31-8-2015 E
101 1-9-2015 31-1-2016 F
101 1-9-2015 31-1-2016 F
The idea for the joined table C is that every row of table A is informed by data from table B. I want to select an entry from B that falls within the span of table A. If several entries from B fit, the most recent one should be used. If table B doesn’t have information from that period (as is the case in the first row), the closest information should be used. A different way to put it is that I want the most recent information of B added to each row of A.
Table C
----------------------------------------------------------------------+
participant startA endA startB endB Content
----------------------------------------------------------------------+
101 1-1-2010 26-4-2010 1-1-2012 31-8-2012 A
101 27-4-2010 2-10-2014 1-9-2013 31-8-2014 C
101 3-10-2014 4-1-2015 1-9-2014 2-10-2014 D
101 5-1-2015 31-8-2015 3-10-2014 31-8-2015 E
101 1-9-2015 12-10-2016 1-9-2015 31-1-2016 F
101 13-10-2016 31-12-2018 1-9-2015 31-1-2016 F
This is the first time that I’m working with SAS and SQL, so my own efforts work very poorly. Below, I’m joining these two tables in a procedure with several steps: I’m first creating a full join to get all possible (relevant) permutations of table A and B. Then I calculate the date difference between the data from table A and B. Finally, for each period of A, I’m selecting the row where there is minimal difference in dates between the data from the original tables.
/* Create outer join of both tables*/
PROC SQL;
CREATE TABLE work.fulljoin AS
SELECT a.*, b.*
FROM work.table_A AS a
FULL JOIN work.table_B AS b ON a.participant = b.participant;
quit;
/* Group by ID and entry date of each period */
PROC SORT data=work.fulljoin;
BY participant startA;
RUN;
/* Calculate the date differences between tables A and B */
DATA work.fulljoin_wdelta;
SET work.fulljoin;
delta=abs(endA-endB);
RUN;
/* Remove unnecessary rows */
PROC SQL;
CREATE TABLE output.joined AS
SELECT * FROM work.fulljoin_wdelta
GROUP BY participant, startA
HAVING delta=min(delta);
QUIT;
However with large datasets (millions of rows in A and B) this becomes prohibitive. Also, this method doesn’t strictly speaking enforce that you will get the most recent B data for every A period, just the one that’s closest in end date.
Upvotes: 0
Views: 1441
Reputation: 27508
Date range joins can be very complex when you have to deal with ties, maximal coverage versus toe-in-the-water overlaps, etc... You certainly don't want to store the union and intermediaries in the final solution, although they can be helpful while debugging logic.
Here is a correlated sub-query technique that looks up the 'best' content range matching A. It will have problems if the content data is not distinct end_date
with in participant
.
Every one
row (target) has a lookup done. The range overlap logic is important
where one.participant = two.participant
and two.start_date < one.end
and two.end_date > one.start
and allows a content date range to be partially outside the target range.
data one;
input participant start: ddmmyy. end: ddmmyy.;
format start end yymmdd10.;
datalines;
101 1-1-2010 26-4-2010
101 27-4-2010 2-10-2014
101 3-10-2014 4-1-2015
101 5-1-2015 31-8-2015
101 1-9-2015 12-10-2016
101 13-10-2016 31-12-2018
;
data two;
input participant start_date: ddmmyy. end_date: ddmmyy. Content: $;
format start_date end_date yymmdd10.;
datalines;
101 1-1-2012 31-8-2012 A
101 1-9-2012 31-8-2013 B
101 1-9-2013 31-8-2014 C
101 1-9-2014 2-10-2014 D
101 3-10-2014 31-8-2015 E
101 1-9-2015 31-1-2016 F
101 1-9-2015 31-1-2017 F
run;
proc sql;
create table want as
select
one.*
, ( select min(content)
from two
where one.participant = two.participant
and two.start_date < one.end
and two.end_date > one.start
group by participant
having end_date = max(end_date)
) as content
from
one
order by
participant, start
;
quit;
Upvotes: 0
Reputation: 9569
I think you can simplify your existing logic into a single query:
proc sql noprint _method;
create table table_c as
select
a.participant,
a.start as start_a,
a.end as end_a,
b.start_date as start_b,
b.end_date as end_b,
abs(a.end - b.end_date) as delta
from table_a a inner join table_b b
on a.participant = b.participant
group by a.participant, start_a
having delta = min(delta)
;
quit;
The log output confirms that this performs a hash join as long as you have enough memory:
NOTE: SQL execution methods chosen are:
sqxcrta
sqxsumg
sqxsort
sqxjhsh
sqxsrc( WORK.TABLE_B(alias = B) )
sqxsrc( WORK.TABLE_A(alias = A) )
If the resulting table is different from what you're trying to produce, please clarify.
Upvotes: 0