Oliver
Oliver

Reputation: 194

SAS: How to count distinct at multiple levels

Is there an elegant way in SAS to calculate a "count distinct" at multiple levels without using a series of sql statements?

Suppose we are counting unique customers. Joe has bought 2 large widgets and 1 small widget, and it is simple enough to count him as 1 customer for large widgets and 1 customer for small widgets. But we also need to count him as only 1 "widget customer", so it requires a separate SQL calculation. And it turns out Joe has also bought a mouse pad, so when we count unique customers for the Miscellaneous Products group we need a third SQL calculation. Joe bought all this stuff from store #1, but when we count the unique customers of miscellaneous products for the Small Store Number Region we also have to account for Joe's purchase of a coffee cup from store #2. Another SQL calculation.

There are many ways in SAS to count things, but it seems there is not a simple way to count "distinct" things. Any suggestions?

Upvotes: 2

Views: 2900

Answers (3)

Robert Penridge
Robert Penridge

Reputation: 8513

Reaaally short on time so I'll dump the code and run... but I think this should do the trick. I'm using the sashelp.cars dataset for some sample data. I'm going to say that the combination of make+model is the 'distinct' value that I want to measure in as many different ways as possible.

Create sample data. Flag is just a numeric field that we're going to use to use in the next step. Note that in this data it's important that if you have fields a,b,c,make+model then that combination of values should already be unique.

data test;
  length key $50;
  set sashelp.cars;
  key = cats(make,model);
  flag = 1;
  keep make type origin key flag ;
run;

Use proc summary to generate all combinations that we're interested in. Note the where clause limits the rows we keep to just those rows where make+model was included in the considered fields. Try commenting out the where statement to see what I mean.

proc summary data=test noprint missing;
  class make type origin key;
  var flag;
  output out=smry(where=(mod(_type_,2) eq 1)) sum=;
run;

Simply summarise the resulting table and voila!

proc sql noprint;
  create table all_combos as
  select make,
         type,
         origin,
         sum(flag) as distinct_keys
  from smry
  group by 1,2,3,4
  order by _type_
  ;
quit;

I'll come back to this and explain in more detail how it works. If someone else has the time right now then feel free to edit this post ;-)

EDIT : Just read JJFord's answer and this is basically the implementation of that.

Upvotes: 2

Joe
Joe

Reputation: 63434

You've got a few different problems here, and it depends on the structure of the data. Happily you didn't provide that, so I get to decide!

What we're going to assume is you have a dataset like this. You also could have the two LargeWidget purchases in two rows (and similar for other products), that wouldn't have any effect on this.

data customers;
  length product $20;
  input name $ product $ store $ count;
datalines;
Joe SmallWidget Store1 1
Joe LargeWidget Store1 2
Joe Mousepad Store1 1
Joe TeaPitcher Store2 1
Jack SmallWidget Store2 1
Jack Mousepad Store1 1
Jane LargeWidget Store2 1
Jane Mousepad Store1 1
Jill LargeWidget Store3 1
;;;;
run;

Then we need to do one more setup: determine how you intend to group things. Multilabel formats let you ask one value to fall into multiple buckets (ie, Large Widgets and all Widgets both). This is one example, you can do many different things here; you can also do this from a dataset (look about for CNTLIN option on proc format, or ask another question).

proc format;
  value $prodgroup (multilabel default=20)
    SmallWidget = "Small Widgets"
    SmallWidget = Widgets
    LargeWidget = "Large Widgets"
    LargeWidget = Widgets
    Mousepad = Mousepads
    Mousepad = "Misc Products"
    TeaPitcher = "Tea Pitchers"
    TeaPitcher = "Misc Products"
  ;
  value $storeRegions (multilabel)
    Store1=Store1
    Store1=West
    Store2=Store2
    Store2=East
    Store3=Store3
    Store3=East
  ;
quit;

Then we face the big problem: SAS sucks at calculating 'distinct' things. It just wasn't really made to do that. I don't know why; tabulate really should, but it doesn't. However, it does a really good job of putting things in buckets, and that's really the hard part here; the actual distinct counting bit can be done in a second pass (which, regardless of your initial data size, will be really fast, assuming you have many fewer kinds of products than you have data points).

proc tabulate data=customers out=custdata;
  class product store /mlf preloadfmt order=data;  
  class name;
  format product $prodgroup20.;
  format store $storeRegions6.;
  tables (all store product store*product),name*n;
run;

proc tabulate data=custdata;
  class product store/missing;
  tables (product*store), n;
run;

The MLF PRELOADFMT ORDER=DATA makes the multilabel bit work properly and makes it come out in a useful order (in my opinion). You can add NOTSORTED to the value statement in PROC FORMAT to get even more control, though the second PROC TABULATE will just mess things up in terms of order so I wouldn't bother.

In this case what we do is first produce a table that is by customer, then that table gets output to a dataset; that dataset now is guaranteed to contain one line per customer per [whatever grouping], as long as the groupings are mirrored in both procs (minus the NAME class of course!).

The one deficiency of this process is that it doesn't produce lines for those products with 0to do that you would need to not have the second tabulate, and instead do it in SQL or a data step. For example:

proc tabulate data=customers out=custdata2;
  class product store /mlf preloadfmt order=data;  
  class name;
  format product $prodgroup20.;
  format store $storeRegions6.;
  tables (all store product store*product),name*n/printmiss;
run;

proc sql;
  select product, store, sum(case when n>0 then 1 else 0 end) as count
  from custdata2
  group by product,store;
quit;

Note the one change to the TABLE line: printmiss, which instructs Tabulate to make one line for every possible combination no matter what. Then SQL does the rest of the work. Of course this SQL could also replicate the second Tabulate statement earlier if you prefer SQL.

Upvotes: 4

JJFord3
JJFord3

Reputation: 1985

If I'm interpreting it right, it sounds like proc means with a class statement may give you what you want.

Add store (1 vs 2), product (large widget, small widget, coffee, & mouse pad), product rollup (widget vs non-widget) as classes and look at the frequency of occurrence.

If Joe's coffee shows up multiple times for the same purchase, you may need to pre-process the data with a single SQL select distinct statement.

I am reaching right now though as I'm guessing at the structure of the data. If you add a small example of the data, I'm sure we can get to a correct answer.

Upvotes: 3

Related Questions