Getting value of row that corresponds to distinct? (proc sql)

Question

I have a dataset unique across 5 variables. The 5th is an identifier variable. Finally, I have a 6th variable, which is dependent on the identifier variable.

The identifier variable can appear in multiple places. The dependent variable will never change for a given value of the identifier variable.

I have code such as the following:

proc sql;
select
...
, count(distinct identifier) as n_ids
from
group by

Which selects the number of unique identifiers per group of 4 independent variables. I'm hoping to add on to this the sum of the 6th variable, which would be something like the following:

sum(case when distinct identifier then dependent_var else 0 end)

Which obviously does not work (and for good reason). Any clean way of finding this sum within the sql step?

Joe · Accepted Answer

The easiest solution is probably to summarize the dataset first by the identifier:

proc sql;
select biggerstuff, identifier, max(depvar) 
from yourdataset
group by biggerstuff,identifier;
quit;

Then insert that in your larger query in the place of the 'from' statement (select blah, count(identifier), sum(depvar)). Once you've pre-summarized it in the inner query you know that you only get 1 row per identifier so distinct isn't needed any longer.

Getting value of row that corresponds to distinct? (proc sql)

Answers (1)

Related Questions