Reputation: 33
Simplifying the description of my table to keep my question concise...
I have a data set with 3 columns. The first column contains 100 cost categories (i.e. unique key) and second column contains cost for the given cost category, the third contains units sold.
My goal is to turn this into a table with a column for each CostCat containing a sum of the Cost field for that given category, grouped by UnitsSold.
i.e.
╔════════════╦══════════╦══════════╦═══════
║ UnitsSold ║ CatCost1 ║ CatCost2 ║ CostCat...
╠════════════╬══════════╬══════════╬═══════
║ 1 ║ 50 ║ 10 ║ ...
║ 2 ║ 20 ║ 15 ║ ...
║ ... ║ ... ║ ... ║ ...
╚════════════╩══════════╩══════════╩═══════
My inclination is to use code like this:
PROC SQL;
CREATE TABLE cartesian AS
SELECT
UnitsSold,
SUM(CASE WHEN CostCat=1 THEN Cost else 0 end) as CostCat1,
sum(case when CostCat=1 then Cost else 0 end) as CostCat2,
sum(case when CostCat=1 then Cost else 0 end) as CostCat3,
...
sum(case when CostCat=100 then Cost else 0 end) as CostCat100
GROUP BY UnitsSold;
QUIT;
I'm wondering if there is a more effective way of doing this than writing out a ridiculous amount of CASE statements? (obviously using Excel to generate the actual typing).
I imagine there is some type of macro looping possible, but not familiar enough with macros to know how to do this yet.
I traditionally use PROC SQL, so that is my first preference, but open to SAS code solutions as well
Upvotes: 1
Views: 2635
Reputation: 27498
Michael:
The question is describing a PIVOT operation, also known as TRANSPOSE in SAS lingo, Paste/Special Transpose or PIVOT table in Excel.
If you stick with Proc SQL statements there is no PIVOT operator. SQL Server and other data bases do have PIVOT operators. But suppose you do stick with SAS Proc SQL. You are correct that you will need those many CASE statements in order to create the across variables.
There are numerous ways to pivot data in SAS. Here are six ways:
data have;
do row = 1 to 500;
cost_cat = ceil(100 * ranuni(123));
cost = 10 + floor(50 * ranuni(123));
units_sold = floor (20 * ranuni(123));
output;
end;
run;
Class variables are used in the table statement to layout the rows and columns.
proc tabulate data=have;
class cost_cat units_sold;
var cost;
table units_sold, cost_cat*cost*sum / nocellmerge;
run;
cost category and cost columns are stacked. Cost
does not have a define
statement and will default to display sum
. The sum is performed for cost over values in each group * across:
proc report data=have;
columns units_sold (cost_cat, cost) ;
define units_sold / group;
define cost_cat / across;
run;
Transpose will create a data set with columns 'out of order' because the columns are created in the order in which the id values appear as you step through the units_solds.
This can be prevented by adding extra data to have
. The data would have units_sold = -1 and there would be a row for each cost_cat value. The extra group is removed as part of the TRANSPOSE out= data set options -- for example: (... where=(units_sold ne -1))
proc means noprint data=have;
class units_sold cost_cat;
var cost;
ways 2;
output sum=sum out=haveMeans ;
run;
proc transpose data=haveMeans out=wantAcross1(drop=_name_) prefix=cost_sum_;
by units_sold;
var sum;
id cost_cat;
;
run;
The macro is simpler because it is specific to the data set in question. For a more general case the salient aspects of the statement generation can be abstracted and further macro-ized (see Way 5)
%macro pivot_across;
%local i;
proc sql;
create table wantAcross2 as
select units_sold
%do i = 1 %to 100; %* codegen many sql select expressions;
, sum ( case when cost_cat = &i then cost else 0 end ) as cost_sum_&i
%end;
from have
group by units_sold;
quit;
%mend;
%pivot_across;
Tip: With a few changes the code gen can be Proc SQL pass-through and perform the pivot remotely.
Well not quite any data set. This macro in its current form handles id variables that are numeric and whose values are expressible exactly as a perceived numeric literal emitted by cats()
. A more robust version would examine the type of the id variable and quote the id values compared to in the generated CASE statements. The most robust version would have a code gen'd CASE statement that checked id values per put(..., RB8.)
%macro sql_transpose (data=, out=, by=, var=, id=, aggregate_function=sum, default=0, prefix=, suffix=);
/*
* CASE statement codegener will need tweaking to handle character id variables (i.e. QUOTE of the &id)
* CASE statement codegener will need tweaking to handle numeric id variables that have non-integer values
* inexpressible as a simple source code numeric literal. (i.e. may need to compare data when formnatted as RB4.);
*/
%local case_statements;
proc sql noprint;
select
"&aggregate_function ("
|| "CASE when &id = " || cats(idValues.&id) || " then &var else &default end"
|| ") as &prefix" || cats(idValues.&id) || "&suffix"
into :case_statements
separated by ','
from (select distinct &id from &data) as idValues
order by &id
;
%*put NOTE: %superq(case_statements);
create table &out as
select &by, &case_statements
from &data
group by &by;
quit;
%mend;
%sql_transpose
( data=have
, out=wantAcross3
, by=units_sold
, id=cost_cat
, var=cost
, prefix=cost_sum_
);
Tip: With a few changes the code gen can be Proc SQL pass-through and perform the pivot remotely. Special attention would need to be made for gathering the data behind the case_statements
.
If you are a hashaholic this code might not seem extravagant.
data _null_;
if 0 then set have(keep=units_sold cost_cat cost); * prep pdv;
* hash for tracking id values;
declare hash ids(ordered:'a');
ids.defineKey('cost_cat');
ids.defineDone();
* hash for tracking sums
* NOTE: The data node has a sum variable instead of using
* argument tags suminc: and keysum: This was done because HITER NEXT() does not
* automatically load the keysum value into its PDV variable (meaning
* another lookup via .SUM() would have to occur in order to obtain it);
call missing (cost_sum);
declare hash sums(ordered:'a');
sums.defineKey('units_sold', 'cost_cat');
sums.defineData('units_sold', 'cost_cat', 'cost_sum');
sums.defineDone();
* scan the data - track the id values and sums for pivoted output;
do while (not done);
set have(keep=units_sold cost_cat cost) end=done;
ids.ref();
if (0 ne sums.find()) then cost_sum = 0;
cost_sum + cost;
sums.replace();
end;
* create a dynamic output target;
* a pool of pdv host variables is required for target;
array cells cost_sum_1 - cost_sum_10000;
call missing (of cost_sum_1 - cost_sum_10000);
* need to iterate across the id values in order to create a
* variable name that will be part of the wanted data node;
declare hiter across('ids');
declare hash want(ordered:'a');
want.defineKey('units_sold');
want.defineData('units_sold');
do while (across.next() = 0);
want.defineData(cats('cost_sum_',cost_cat)); * sneaky! ;
end;
want.defineDone();
* populate target;
* iterate through the sums filling in the PDV variables
* associated with the dynamically defined data node;
declare hiter item('sums');
prior_key1 = .; prior_key2 = .;
do while (item.next() = 0);
if units_sold ne prior_key1 then do;
* when the 'group' changes add an item to the want hash, which will reflect the state of the PDV;
if prior_key1 ne . then do;
key1_hold = units_sold;
units_sold = prior_key1;
want.add(); * save 'row' to hash;
units_sold = key1_hold;
call missing (of cells(*));
end;
end;
cells[cost_cat] = cost_sum;
prior_key1 = units_sold;
end;
want.add();
* output target;
want.output (dataset:'wantAcross4');
stop;
run;
Proc COMPARE
will show that all the want
outputs are the same.
proc compare nomissing
noprint data=wantAcross1 compare=wantAcross2 out=diff1_2 outnoequal;
id units_sold;
run;
proc compare
noprint data=wantAcross2 compare=wantAcross3 out=diff2_3 outnoequal;
id units_sold;
run;
proc compare nomissing
noprint data=wantAcross3 compare=wantAcross4 out=diff3_4 outnoequal;
id units_sold;
run;
Upvotes: 2
Reputation: 1103
As Reeza pointed out, the best way to do this is probably through a combination of proc sql
or proc means/summary
and proc transpose
. I'm assuming you know SQL so I'll go into that description first.
proc sql;
create table tmp as
select UnitsSold, CostCat, sum(cost) as cost
from have
group by UnitsSold, CostCat;
quit;
If you wanted to do this via a SAS procedure you can use proc summary
.
proc summary data=have nway missing;
class UnitsSold CostCat;
var Cost;
output out=tmp(drop=_:) sum=; ** drop=_: removes the automatic variables created in the procedure;
run;
Now that the table is summarized and sorted by UnitsSold
and CostCat
you can transpose the table.
proc transpose data=tmp out=want(drop=_NAME_) prefix=CostCat;
by UnitsSold;
id CostCat;
var cost;
run;
Upvotes: 0