WinstonDoodle
WinstonDoodle

Reputation: 33

SAS - PROC SQL - Sum values into unique columns

Simplifying the description of my table to keep my question concise...

I have a data set with 3 columns. The first column contains 100 cost categories (i.e. unique key) and second column contains cost for the given cost category, the third contains units sold.

My goal is to turn this into a table with a column for each CostCat containing a sum of the Cost field for that given category, grouped by UnitsSold.

i.e.

╔════════════╦══════════╦══════════╦═══════
║  UnitsSold ║ CatCost1 ║ CatCost2 ║ CostCat...
╠════════════╬══════════╬══════════╬═══════
║    1       ║    50    ║    10    ║ ...
║    2       ║    20    ║    15    ║ ...
║    ...     ║    ...   ║    ...   ║ ...
╚════════════╩══════════╩══════════╩═══════

My inclination is to use code like this:

PROC SQL;
CREATE TABLE cartesian AS
SELECT
  UnitsSold,
  SUM(CASE WHEN CostCat=1 THEN Cost else 0 end) as CostCat1,
  sum(case when CostCat=1 then Cost else 0 end) as CostCat2,
  sum(case when CostCat=1 then Cost else 0 end) as CostCat3,
  ...
  sum(case when CostCat=100 then Cost else 0 end) as CostCat100
GROUP BY UnitsSold;
QUIT;

I'm wondering if there is a more effective way of doing this than writing out a ridiculous amount of CASE statements? (obviously using Excel to generate the actual typing).

I imagine there is some type of macro looping possible, but not familiar enough with macros to know how to do this yet.

I traditionally use PROC SQL, so that is my first preference, but open to SAS code solutions as well

Upvotes: 1

Views: 2635

Answers (2)

Richard
Richard

Reputation: 27498

Michael:

The question is describing a PIVOT operation, also known as TRANSPOSE in SAS lingo, Paste/Special Transpose or PIVOT table in Excel.

If you stick with Proc SQL statements there is no PIVOT operator. SQL Server and other data bases do have PIVOT operators. But suppose you do stick with SAS Proc SQL. You are correct that you will need those many CASE statements in order to create the across variables.

There are numerous ways to pivot data in SAS. Here are six ways:

Sample data

data have;
  do row = 1 to 500;
    cost_cat = ceil(100 * ranuni(123));
    cost = 10 + floor(50 * ranuni(123));
    units_sold = floor (20 * ranuni(123));
    output;
  end;
run;

Way 1 - Proc TRANSPOSE : Pivot for presentation only

Class variables are used in the table statement to layout the rows and columns.

proc tabulate data=have;
  class cost_cat units_sold;
  var cost;
  table units_sold, cost_cat*cost*sum / nocellmerge;
run;

Way 2 - Proc REPORT : Pivot for presentation only

cost category and cost columns are stacked. Cost does not have a define statement and will default to display sum. The sum is performed for cost over values in each group * across:

proc report data=have;
  columns units_sold (cost_cat, cost) ;
  define units_sold / group;
  define cost_cat / across;
run;

Way 3 - Proc MEANS + Proc TRANSPOSE : Pivot for data

Transpose will create a data set with columns 'out of order' because the columns are created in the order in which the id values appear as you step through the units_solds.
This can be prevented by adding extra data to have. The data would have units_sold = -1 and there would be a row for each cost_cat value. The extra group is removed as part of the TRANSPOSE out= data set options -- for example: (... where=(units_sold ne -1))

proc means noprint data=have;
  class units_sold cost_cat;
  var cost;
  ways 2;
  output sum=sum out=haveMeans ;
run;

proc transpose data=haveMeans out=wantAcross1(drop=_name_) prefix=cost_sum_;
  by units_sold;
  var sum;
  id cost_cat;
  ;  
run;

Way 4 - SQL `wallpaper` code generated by Macro : Specific to one data set

The macro is simpler because it is specific to the data set in question. For a more general case the salient aspects of the statement generation can be abstracted and further macro-ized (see Way 5)

%macro pivot_across;
  %local i;

  proc sql;
    create table wantAcross2 as
    select units_sold
    %do i = 1 %to 100;  %* codegen many sql select expressions;
    , sum ( case when cost_cat = &i then cost else 0 end ) as cost_sum_&i
    %end;
    from have
    group by units_sold;
  quit;
%mend;

%pivot_across;

Tip: With a few changes the code gen can be Proc SQL pass-through and perform the pivot remotely.

Way 5 - SQL `wallpaper` code generated by Macro : Any data set

Well not quite any data set. This macro in its current form handles id variables that are numeric and whose values are expressible exactly as a perceived numeric literal emitted by cats(). A more robust version would examine the type of the id variable and quote the id values compared to in the generated CASE statements. The most robust version would have a code gen'd CASE statement that checked id values per put(..., RB8.)

%macro sql_transpose (data=, out=, by=, var=, id=, aggregate_function=sum, default=0, prefix=, suffix=);

/*
 * CASE statement codegener will need tweaking to handle character id variables (i.e. QUOTE of the &id)
 * CASE statement codegener will need tweaking to handle numeric id variables that have non-integer values
 * inexpressible as a simple source code numeric literal. (i.e. may need to compare data when formnatted as RB4.);
 */

  %local case_statements;

  proc sql noprint;
    select
    "&aggregate_function ("
    || "CASE when &id = " || cats(idValues.&id) || " then &var else &default end"   
    || ") as &prefix" || cats(idValues.&id) || "&suffix"
    into :case_statements
    separated by ','
    from (select distinct &id from &data) as idValues
    order by &id
    ;

  %*put NOTE: %superq(case_statements);

    create table &out as
    select &by, &case_statements
    from &data
    group by &by;
  quit;

%mend;

%sql_transpose 
( data=have
, out=wantAcross3
, by=units_sold
, id=cost_cat
, var=cost
, prefix=cost_sum_
);

Tip: With a few changes the code gen can be Proc SQL pass-through and perform the pivot remotely. Special attention would need to be made for gathering the data behind the case_statements.

Way 6 - Hash table : Numerically indexed pivot columns

If you are a hashaholic this code might not seem extravagant.

data _null_;
  if 0 then set have(keep=units_sold cost_cat cost); * prep pdv;

  * hash for tracking id values;

  declare hash ids(ordered:'a');
  ids.defineKey('cost_cat');
  ids.defineDone();

  * hash for tracking sums
  * NOTE: The data node has a sum variable instead of using 
  * argument tags suminc: and keysum:  This was done because HITER NEXT() does not 
  * automatically load the keysum value into its PDV variable (meaning
  * another lookup via .SUM() would have to occur in order to obtain it);

  call missing (cost_sum);

  declare hash sums(ordered:'a');
  sums.defineKey('units_sold', 'cost_cat');
  sums.defineData('units_sold', 'cost_cat', 'cost_sum');
  sums.defineDone();

  * scan the data - track the id values and sums for pivoted output;

  do while (not done);
    set have(keep=units_sold cost_cat cost) end=done;

    ids.ref();

    if (0 ne sums.find()) then cost_sum = 0;
    cost_sum + cost;
    sums.replace();
  end;

  * create a dynamic output target;
  * a pool of pdv host variables is required for target;

  array cells cost_sum_1 - cost_sum_10000;
  call missing (of cost_sum_1 - cost_sum_10000);

  * need to iterate across the id values in order to create a 
  * variable name that will be part of the wanted data node;

  declare hiter across('ids');

  declare hash want(ordered:'a');
  want.defineKey('units_sold');
  want.defineData('units_sold');
  do while (across.next() = 0);
    want.defineData(cats('cost_sum_',cost_cat));  * sneaky! ;
  end;
  want.defineDone();

  * populate target;
  * iterate through the sums filling in the PDV variables
  * associated with the dynamically defined data node;

  declare hiter item('sums');
  prior_key1 = .; prior_key2 = .;
  do while (item.next() = 0);
    if units_sold ne prior_key1 then do;
      * when the 'group' changes add an item to the want hash, which will reflect the state of the PDV;
      if prior_key1 ne . then do;
        key1_hold = units_sold;
        units_sold = prior_key1;
        want.add();                  * save 'row' to hash;
        units_sold = key1_hold;
        call missing (of cells(*));
      end;
    end;

    cells[cost_cat] = cost_sum;
    prior_key1 = units_sold;
  end;
  want.add();

  * output target;

  want.output (dataset:'wantAcross4');

  stop;
run;

Verification

Proc COMPARE will show that all the want outputs are the same.

proc compare nomissing 
  noprint data=wantAcross1 compare=wantAcross2 out=diff1_2 outnoequal;
  id units_sold;
run;

proc compare 
  noprint data=wantAcross2 compare=wantAcross3 out=diff2_3 outnoequal;
  id units_sold;
run;

proc compare nomissing 
  noprint data=wantAcross3 compare=wantAcross4 out=diff3_4 outnoequal;
  id units_sold;
run;

Upvotes: 2

J_Lard
J_Lard

Reputation: 1103

As Reeza pointed out, the best way to do this is probably through a combination of proc sql or proc means/summary and proc transpose. I'm assuming you know SQL so I'll go into that description first.

proc sql;
create table tmp as
select UnitsSold, CostCat, sum(cost) as cost
from have
group by UnitsSold, CostCat;
quit;

If you wanted to do this via a SAS procedure you can use proc summary.

proc summary data=have nway missing;
class UnitsSold CostCat;
var Cost;
output out=tmp(drop=_:) sum=;  ** drop=_: removes the automatic variables created in the procedure;
run;

Now that the table is summarized and sorted by UnitsSold and CostCat you can transpose the table.

proc transpose data=tmp out=want(drop=_NAME_) prefix=CostCat;
by UnitsSold;
id CostCat;
var cost;
run;

Upvotes: 0

Related Questions