Reputation: 165
I have a data set with one grouping character variable x and one numeric variable y. For each value of x I want to select value of y that occurs the most, but for each value of x I want only one value of y (doesn't matter which one). I know how to select all such y for each x, but I don't know how to select only one.
proc sql;
select x,y
from (select x, y, count(*) as n_y
from lab7.a
group by x,y)
group by x
having n_y=max(n_y)
;
quit;
The code above gives me all such y for each x, but I would like to have only one y for each x.
I tried using order by n_y in a subquery and then select top 1, but apparently it doesn't work in sas. I also tried monotonic, but it adds number to rows to the whole query, it doesn't work within a group.
For example if I had data:
x y
A 1
A 2
A 2
B 1
B 2
C 3
C 3
my code will return
x y
A 2
B 1
B 2
C 3
but I would like
x y
A 2
B 1
C 3
or
x y
A 2
B 2
C 3
Upvotes: 2
Views: 2324
Reputation: 63424
Since SAS doesn't support the windowing functions or TOP X, I don't think it's technically doable in a single query. You can probably use MONOTONIC() in conjunction with outer query, though:
select x,y from (
select x,y, monotonic() as m
from (select x, y, count(*) as n_y
from have
group by x,y)
group by x
having n_y=max(n_y)
)
group by x
having m=max(m)
;
Or
proc sql;
select x,max(y) from (
select x,y
from (select x, y, count(*) as n_y
from have
group by x,y)
group by x
having n_y=max(n_y)
)
group by x
;
quit;
That of course gives you a biased result - if you're okay with that, then that's fine.
If you're not, what I would do is take that SQL output which has two, then use PROC SURVEYSELECT
to pick one at random (with strata x
and n=1
).
Upvotes: 3
Reputation: 63424
The better, non-SQL way, to do this is to use PROC UNIVARIATE
. The only restriction here is that it doesn't seem to work if every value is a mode (i.e., in your initial example, B has only 2 values, and both are modes). But as long as you have enough values to make this question make any sense, this should work.
proc univariate data=have;
by x;
var y;
output out=want mode=mode_y;
run;
That gives you the mode, and probably does so much faster than the SQL
query. It does require sorting by x
, though, if it's not already. It gives you the lowest mode by default.
Upvotes: 2
Reputation: 51566
Why use PROC SQL
?
proc freq data=have ;
tables x*y / noprint out=counts ;
run;
proc sort data=counts;
by x descending count;
run;
data want ;
set counts ;
by x ;
if first.x ;
run;
Upvotes: 1