pyll
pyll

Reputation: 1764

Cartesian Join SAS Proc SQL

I have two data sets that look like this:

data sales;
    format week date9.;
    input store $ week date9. sales;
cards;
A 01JAN2014 200
A 08JAN2014 500
A 22JAN2014 200
B 01JAN2014 100
B 08JAN2014 200
B 15JAN2014 200
;
run;

data dates;
    format week date9.;
    input week date9.;
cards;
01JAN2014
08JAN2014
15JAN2014
22JAN2014
29JAN2014
;
run;

These are small examples of very large tables in a database. I want to join them using a proc sql step, so that all dates are present for each location (with missing values if no sales). I've been creating a cartesian join of the stores and weeks to produce the desired result, but this method has proven to be extremely inefficient when I attempt to query large amounts of data. I know there is a better way to do this, but cannot get it. Below is my attempt...to clarify, the full_sales data set IS the desired output, I just need a more efficient way to do it. Thanks.

proc sql;
    create table cartesian as
    select distinct sales.store, dates.week
    from sales, dates
    order by 1,2;
quit;

proc sql;
    create table full_sales as
    select cartesian.store, 
                cartesian.week,
                sales.sales
    from cartesian
    left join sales
        on cartesian.store=sales.store and
        cartesian.week=sales.week
    order by 1,2
    ;
quit;

Upvotes: 0

Views: 1585

Answers (2)

Longfish
Longfish

Reputation: 7602

I'll recommend an alternative approach, which involves creating a format from the DATES dataset, then using completetypes and preloadfmt in proc summary to output all the observations.

This makes the assumption that all the required dates appear in the DATES dataset(i.e. there are no dates in SALES that don't appear in DATES). This may be a faster approach than using proc sql.

I've added a step at the end to change the format back to the standard date9. as you could get problems with people opening the dataset with the user defined wk_fmt. format.

data sales;
    format week date9.;
    input store $ week :date9. sales;
cards;
A 01JAN2014 200
A 08JAN2014 500
A 22JAN2014 200
B 01JAN2014 100
B 08JAN2014 200
B 15JAN2014 200
;
run;

data dates;
    format week date9.;
    input week :date9.;
cards;
01JAN2014
08JAN2014
15JAN2014
22JAN2014
29JAN2014
;
run;

/* create dataset with format details */
data week_format;
set dates;
rename week = start;
retain fmtname 'wk_fmt' type 'N';
label = vvalue(week);
run;

/* load format from previous dataset */
proc format cntlin=week_format;
run;

/* create table of all combinations of store and dates */
proc summary data = sales nway completetypes;
class store;
class week / preloadfmt;
format week wk_fmt.;
id sales;
output out=want (drop=_:);
run;

/* change format back to date9. */
proc datasets lib=work nodetails nolist;
modify want;
format week date9.;
quit;

Upvotes: 2

Gordon Linoff
Gordon Linoff

Reputation: 1269803

I would approach this as:

proc sql;
    create table full_sales as
    select s.store, w.week, ssa.sales
    from (select distinct store from sales) s cross join
         dates w left join
         sales sa
         on s.store = sa.store and
            w.week = sa.week
    order by 1,2
    ;
quit;

This does not require the auxiliary table cartesian, which might be faster. An index on sales(store, week) would definitely speed the query.

Upvotes: 2

Related Questions