JustinJDavies
JustinJDavies

Reputation: 2693

Create table that splits records on specific dates using SAS

I wish to take table A and create something like table B, but based on an arbitrary set of split dates contained in table C.

For example, (note it is not always true that start_date = inception_date, and so inception_date must be preserved rather than derived from start_date; this actually represents hundreds of fields that belong with the period)

enter image description here

I'm working in SAS but I'd like to be able to write this using PROC SQL. I think one way to do this would be to create multiple tables for pairs of records from table C (including nulls at the end), and then union them together.

Pseudo-code example:

for each record of table_c, concoct the pairs { (., 01-Jan-2012), (01-Jan-2012, 01-Jul-2012), (01-Jul-2012, 01-Jan-2013), (01-Jan-2013, .) }

The following query may require some null testing around split_date1 and split_date2:

CREATE TABLE subquery1 AS
SELECT 
    a.customer_id
    ,max(a.start_date, x.split_date1) AS start_date
    ,min(a.end_date, x.split_date2 - 1) AS end_date
    ,a.inception_date
FROM table_a AS a
JOIN split_date AS x
;
.... (do for each pair of split dates, and then union all these tables together with some WHERE querying to throw away the nonsensical rows) to produce table_b. The image above indicates which subquery would generate which rows in table_b

Please help me fill in the gaps, or suggest an alternative method.

table_a:

customer_id start_date  end_date    inception_date      
aaa 18-Jun-11   17-Jun-12   18-Jun-11       
aaa 18-Jun-12   17-Jun-13   18-Jun-12       
bbb 13-Jul-11   12-Jul-12   13-Jul-11       
ccc 14-May-11   13-Nov-11   14-Jul-11       
ddd 21-Jun-11   20-Jun-12   21-Jun-11

table_b:

customer_id start_date  end_date    inception_date      subquery
aaa 18-Jun-11   31-Dec-11   18-Jun-11       (1)
aaa 01-Jan-12   17-Jun-12   18-Jun-11       (2)
aaa 18-Jun-12   30-Jun-12   18-Jun-12       (2)
aaa 01-Jul-12   31-Dec-12   18-Jun-12       (3)
aaa 01-Jan-13   17-Jun-13   18-Jun-12       (4)
bbb 13-Jul-11   31-Dec-11   13-Jul-11       (1)
bbb 01-Jan-12   30-Jun-12   13-Jul-11       (2)
bbb 01-Jul-12   12-Jul-12   13-Jul-11       (3)
ccc 14-May-11   13-Nov-11   14-May-11       (1)
ddd 21-Jun-11   31-Dec-11   21-Jun-11       (1)
ddd 01-Jan-12   20-Jun-12   21-Jun-11       (2)

table_c:

split_dates                 
01-Jan-12                   
01-Jul-12                   
01-Jan-13   

Upvotes: 3

Views: 2190

Answers (4)

JustinJDavies
JustinJDavies

Reputation: 2693

The relational way of solving this:

  • Using the assumption that the earliest start_date is 01-Jan-00 and the latest us 31-Dec-20

Reformulate table_c as such:

split_start split_end
01-Jan-00 31-Dec-12                   
01-Jan-12 30-Jun-12                   
01-Jul-12 31-Dec-12 
01-Jan-13 31-Dec-20

Then use a query of the form:

SELECT 
  a.customer_id
  ,max(a.start_date, c.split_start) AS start_date
  ,min(a.end_date, c.split_end) AS end_date
  ,a.inception_date
FROM table_a AS a JOIN table_c AS c
WHERE a.start_date < c.split_end
AND a.end_Date > c.split_start
;

The reformulation of table_c using SQL within SAS is made tricky by the inability to self-join using WHERE NOT EXISTS and unreliability of MONOTONIC(). However in MSSQL and other SQL environments this is relatively straightforward to do.

For example:

// INSERT INTO table_c to include two extra dates for max and min dates as above
// then...

WITH table_c_old
AS
(
SELECT ROW_NUMBER() OVER (ORDER BY split_date) seq FROM table_c
)
SELECT 
  c1.seq seq1
  ,c1.split_date AS split_start
  ,c2.seq seq2
  ,c2.split_date - 1 AS split_end
FROM table_c_old c1 LEFT JOIN table_c_old c2
ON c1.seq = c2.seq+1;

Upvotes: 0

rambles
rambles

Reputation: 706

Here's a hybrid SQL/datastep approach - but it is shorter! Input the data (taken from the answer given by @Joe):-

data table_a;
  informat start_date end_date date9.;
  format start_date end_date date9.;
  input customer_id $ start_date end_date;
  datalines;
  aaa 18JUN2011 17JUN2012
  aaa 18JUN2012 17JUN2013
  bbb 13JUL2011 12JUL2012
  ccc 14MAY2011 13NOV2011
  ddd 21JUN2011 20JUN2012
  ;;;;
run;

data table_c;
  informat split_dates date9.;
  format split_dates date9.;
  input split_dates;
  datalines;
  01JAN2012
  01JUL2012
  01JAN2013
  ;;;;
run;

The following copies the split dates to a macro variable (SQL!) and then loops through table_a using this macro (datastep!):-

**  Output the split dates to a macro variable;
proc sql noprint;
  select split_dates format=8. into: c_dates separated by ',' from table_c order by split_dates;
quit;

**  For each period in table_a, look to see if each split date is within it,;
**  outputting a row if so;
data final_out(drop=dt old_end_date);
  set table_a(rename=(end_date = old_end_date));

  format start_date end_date inception_date date11.;
  inception_date = start_date;

  do dt = &c_dates;

    if start_date <= dt <= old_end_date then do;
      end_date = dt - 1;
      output;
      start_date = dt;
    end;

  end;

  **  For the last row per table_a entry;
  end_date = old_end_date;
  output;
run;

And if you know the split dates beforehand, you could hard code them into the datastep and omit the SQL bit (not recommended mind - hard coding is seldom a good idea).

Upvotes: 1

JustinJDavies
JustinJDavies

Reputation: 2693

An alternative SQL-based (iterative) procedure would be of the form:

subquery_a

SELECT 
a.*
,c.*
FROM table_a AS a
LEFT JOIN table_c AS c
WHERE a.start_date <= c.split_date
AND a.end_date > c.split_date
;

// gives us a table of all records where at least one split is required

subquery_b

SELECT  
    min(split_date) AS split_date
,a.customer_id
,a.start_date
,a.end_date
,a.inecption_date
FROM subquery_a AS a
GROUP BY
a.customer_id
,a.start_date
,a.end_date
,a.inecption_date   
;

subquery_c

(SELECT
sqb.customer_id
,sqb.start_date
    ,(sqb.split_date - 1) AS end_date
,sqb.inecption_date
FROM subquery_b AS sqb)
UNION ALL
    (SELECT
sqb.customer_id
,sqb.split_date AS start_date
    ,end_date
,sqb.inecption_date
FROM subquery_b AS sqb)

Then drop rows subquery_c LEFT JOIN table_a ON customer_id AND start_date from table_a, followed by UNION of table_a with subquery_c. Iterate until the updated subquery_c has no rows then stop.

However I don't know how to express the final iterative step in SQL alone. Perhaps someone could edit this answer to assist me?

Upvotes: 0

Joe
Joe

Reputation: 63434

Data step solution.

First, sample data (I left out the other date variable, I think it's unimportant to the solution although of course you'll want it in production):

data table_a;
informat start_date end_date date9.;
format start_date end_date date9.;
input customer_id $ start_date end_date;
datalines;
aaa 18JUN2011 17JUN2012
aaa 18JUN2012 17JUN2013
bbb 13JUL2011 12JUL2012
ccc 14MAY2011 13NOV2011
ddd 21JUN2011 20JUN2012
;;;;
run;

data table_c;
informat split_dates date9.;
format split_dates date9.;
input split_dates;
datalines;
01JAN2011
01JUL2011
01JAN2012
01JUL2012
01JAN2013
;;;;
run;

Now, the solution. First, we load the data from table_c into a temporary array; a hash table would also work (and might be faster if table c is very long, since this solution requires iterating over all of the array while a hash table would have a faster time just finding the few that match).

Then we iterate over the array C was loaded into, check if it qualifies as a useful break point, if so assign the start/end dates, output, and re-assign the new start date. Here I use new start/end variables; if you want to keep the old variable names, just rename the original variables on input to some other variable name and then use the original variable names as the new ones and the renamed original variables as the old ones.

data table_b;
 set table_a;
 format final_start final_end date9.;
 array split_date_list[100] _temporary_; *make sure this 100 is as big or bigger than table_c;
 if _n_=1 then do;
  do _t = 1 to nobsc;  *load the contents of table_c into a temporary array;
    set table_c point=_t nobs=nobsc;
    split_date_list[_t]=split_dates;
  end;
 end;
 final_start=start_date; *You could reuse start_date here, I use new name for consistency;
 do _u= 1 to dim(split_date_list) until (final_end=end_date);
  if final_start le split_date_list[_u] le end_date then do;  *if split date is in between start and end, split it;
   final_end=split_date_list[_u]-1;   *But end_date does need a second variable, else it loses track of the actual end;
   output; *output a row;
   final_start=split_date_list[_u]; *fix the start date to the new value;
  end;
  else if split_date_list[_u] gt end_date then do; *if we have passed the end date;
   final_end=end_date;
   output;
  end;
 end;
 if end_date ne final_end then do; *if we never passed the end date, output the final row;
   final_end=end_date;
   output;
 end;
run;

Upvotes: 1

Related Questions