Reputation: 63
I just started learning SAS and realised that proc sql don't use window functions. As I am more at ease with sql I was wondering how I can simulate a sum window function in proc?
desired result
select a.active, a.store_id, a.nbr, sum(nbr) over (partition by a.store_id)
from(select active, store_id, count(customer_id) as nbr from customer group by active, store_id) as a
;
active | store_id | nbr | sum |
---|---|---|---|
0 | 1 | 8 | 326 |
1 | 1 | 318 | 326 |
0 | 2 | 7 | 273 |
1 | 2 | 266 | 273 |
eg of raw data
select active, store_id, customer_id
from customer
limit 10;
active | store_id | customer_id |
---|---|---|
1 | 1 | 1 |
1 | 1 | 2 |
1 | 2 | 3 |
1 | 2 | 4 |
1 | 1 | 5 |
1 | 1 | 6 |
0 | 1 | 7 |
1 | 2 | 8 |
1 | 1 | 9 |
1 | 2 | 10 |
current result and query
select a.active, a.store_id, a.nbr, sum(nbr)
from(select active, store_id, count(customer_id) as nbr from customer group by active, store_id) as a
group by a.active, a.store_id, a.nbr;
active | store_id | nbr | sum |
---|---|---|---|
0 | 1 | 8 | 8 |
1 | 1 | 318 | 318 |
0 | 2 | 7 | 7 |
1 | 2 | 266 | 266 |
Upvotes: 0
Views: 1538
Reputation: 51611
Unlike some SQL implementation SAS is happy to re-merge summary statistics back onto the detail rows when you include variables that are neither group by nor summary statistics.
Let's convert your print outs of data into an actual dataset. And let's change one value so we have at least two values of ACTIVE to group by.
data have;
input active store_id customer_id;
cards;
1 1 1
1 1 2
1 2 3
1 2 4
1 1 5
1 1 6
0 1 7
1 2 8
1 1 9
1 2 10
;
Now we can count the records by ACTIVE and STORE_ID and then generate the report by appending the store total.
proc sql;
select active,store_id,nbr,sum(nbr) as store_nbr
from (select active,store_id,count(*) as nbr
from have
group by active,store_id
)
group by store_id
;
Resulting printout:
active store_id nbr store_nbr
---------------------------------------
0 1 1 6
1 1 5 6
1 2 4 4
Upvotes: 4
Reputation: 12909
You can do the equivalent in proc sql
by merging two sub-queries: one for the count of customers by active, store_id
, and another for the total customers for each store_id
.
proc sql noprint;
create table want as
select t1.active
, t1.store_id
, t1.nbr
, t2.sum
from (select active
, store_id
, count(customer_id) as nbr
from have
group by store_id, active
) as t1
LEFT JOIN
(select store_id
, count(customer_id) as sum
from have
group by store_id
) as t2
ON t1.store_id = t2.store_id
;
quit;
If you wanted to do this in a more SASsy way, you can run proc means
and merge together the results from a single dataset that holds everything you need. proc means
will calculate all possible combinations of your variables by default.
proc means data=have noprint;
class store_id active;
ways 1 2;
output out=want_total
n(customer_id) = total
;
run;
data want;
merge want_total(where=(_TYPE_ = 3) rename=(total = nbr) )
want_total(where=(_TYPE_ = 2) rename=(total = sum) keep=_TYPE_ store_id total)
;
by store_id;
drop _:;
run;
Or, in SQL:
proc sql;
create table want as
select t1.store_id
, t1.active
, t1.total as nbr
, t2.total as sum
from want_total as t1
LEFT JOIN
want_total as t2
ON t1.store_id = t2.store_id
where t1._TYPE_ = 3
AND t2._TYPE_ = 2
;
quit;
The _TYPE_
variable identifies the level of the analysis. For example, _TYPE_ = 1
is for active
only, _TYPE_ = 2
is for store_id
only, and _TYPE_ = 3
is for all combinations. You can view this in the output dataset from proc means
:
store_id active _TYPE_ _FREQ_ total
. 0 1 3 3
. 1 1 7 7
1 . 2 6 6
2 . 2 4 4
1 0 3 1 1
1 1 3 5 5
2 0 3 2 2
2 1 3 2 2
And if you wanted faster high-performance results, check out its big sibling, proc hpsummary
.
Therein lies the cool thing about SAS: You can bounce between PROCs, SQL, the DATA Step, and Python via Pandas/proc python. You can exploit the unique benefits of each of these methods and thought processes for any number of data engineering and statistics problems.
Upvotes: 1