Reputation: 420
I have a dataset unique across 5 variables. The 5th is an identifier variable. Finally, I have a 6th variable, which is dependent on the identifier variable.
The identifier variable can appear in multiple places. The dependent variable will never change for a given value of the identifier variable.
I have code such as the following:
proc sql;
select
...
, count(distinct identifier) as n_ids
from
group by
Which selects the number of unique identifiers per group of 4 independent variables. I'm hoping to add on to this the sum of the 6th variable, which would be something like the following:
sum(case when distinct identifier then dependent_var else 0 end)
Which obviously does not work (and for good reason). Any clean way of finding this sum within the sql step?
Upvotes: 0
Views: 404
Reputation: 63424
The easiest solution is probably to summarize the dataset first by the identifier:
proc sql;
select biggerstuff, identifier, max(depvar)
from yourdataset
group by biggerstuff,identifier;
quit;
Then insert that in your larger query in the place of the 'from' statement (select blah, count(identifier), sum(depvar)). Once you've pre-summarized it in the inner query you know that you only get 1 row per identifier so distinct isn't needed any longer.
Upvotes: 1