christinelly
christinelly

Reputation: 63

How to write proc sql without windowfunction over partition by sum?

I just started learning SAS and realised that proc sql don't use window functions. As I am more at ease with sql I was wondering how I can simulate a sum window function in proc?

desired result

 select a.active, a.store_id, a.nbr, sum(nbr) over (partition by a.store_id)
from(select active, store_id, count(customer_id) as nbr from customer group by active, store_id) as a
;
active store_id nbr sum
0 1 8 326
1 1 318 326
0 2 7 273
1 2 266 273

eg of raw data

select active, store_id, customer_id
from customer
limit 10;
active store_id customer_id
1 1 1
1 1 2
1 2 3
1 2 4
1 1 5
1 1 6
0 1 7
1 2 8
1 1 9
1 2 10

current result and query

    select a.active, a.store_id, a.nbr, sum(nbr) 
    from(select active, store_id, count(customer_id) as nbr from customer group by active, store_id) as a
    group by a.active, a.store_id, a.nbr;
active store_id nbr sum
0 1 8 8
1 1 318 318
0 2 7 7
1 2 266 266

Upvotes: 0

Views: 1538

Answers (2)

Tom
Tom

Reputation: 51611

Unlike some SQL implementation SAS is happy to re-merge summary statistics back onto the detail rows when you include variables that are neither group by nor summary statistics.

Let's convert your print outs of data into an actual dataset. And let's change one value so we have at least two values of ACTIVE to group by.

data have;
  input active store_id customer_id;
cards;
1 1 1
1 1 2
1 2 3
1 2 4
1 1 5
1 1 6
0 1 7
1 2 8
1 1 9
1 2 10
;

Now we can count the records by ACTIVE and STORE_ID and then generate the report by appending the store total.

proc sql;
select active,store_id,nbr,sum(nbr) as store_nbr
  from (select active,store_id,count(*) as nbr 
        from have 
        group by active,store_id
       )
  group by store_id
;

Resulting printout:

  active  store_id       nbr  store_nbr
---------------------------------------
       0         1         1          6
       1         1         5          6
       1         2         4          4

Upvotes: 4

Stu Sztukowski
Stu Sztukowski

Reputation: 12909

You can do the equivalent in proc sql by merging two sub-queries: one for the count of customers by active, store_id, and another for the total customers for each store_id.

proc sql noprint;
    create table want as
        select t1.active
             , t1.store_id
             , t1.nbr
             , t2.sum
        from (select active
                   , store_id
                   , count(customer_id) as nbr
              from have 
              group by store_id, active
             ) as t1

        LEFT JOIN

             (select store_id
                   , count(customer_id) as sum
              from have
              group by store_id
             ) as t2

        ON t1.store_id = t2.store_id
    ;
quit;

If you wanted to do this in a more SASsy way, you can run proc means and merge together the results from a single dataset that holds everything you need. proc means will calculate all possible combinations of your variables by default.

proc means data=have noprint;
    class store_id active;
    ways 1 2;

    output out=want_total
        n(customer_id) = total
    ;
run;

data want;
    merge want_total(where=(_TYPE_ = 3) rename=(total = nbr) )
          want_total(where=(_TYPE_ = 2) rename=(total = sum) keep=_TYPE_ store_id total)
    ;
    by store_id;

    drop _:;
run;

Or, in SQL:

proc sql;
    create table want as
        select t1.store_id
             , t1.active
             , t1.total as nbr
             , t2.total as sum
        from want_total as t1
        LEFT JOIN
             want_total as t2
        ON t1.store_id = t2.store_id
        where     t1._TYPE_ = 3
              AND t2._TYPE_ = 2
    ;
quit;

The _TYPE_ variable identifies the level of the analysis. For example, _TYPE_ = 1 is for active only, _TYPE_ = 2 is for store_id only, and _TYPE_ = 3 is for all combinations. You can view this in the output dataset from proc means:

store_id    active  _TYPE_  _FREQ_  total
.           0       1        3      3
.           1       1        7      7
1           .       2        6      6
2           .       2        4      4
1           0       3        1      1
1           1       3        5      5
2           0       3        2      2
2           1       3        2      2

And if you wanted faster high-performance results, check out its big sibling, proc hpsummary.

Therein lies the cool thing about SAS: You can bounce between PROCs, SQL, the DATA Step, and Python via Pandas/proc python. You can exploit the unique benefits of each of these methods and thought processes for any number of data engineering and statistics problems.

Upvotes: 1

Related Questions