rafalpw
rafalpw

Reputation: 165

How to select the most frequent value within a group (not all but only one)

I have a data set with one grouping character variable x and one numeric variable y. For each value of x I want to select value of y that occurs the most, but for each value of x I want only one value of y (doesn't matter which one). I know how to select all such y for each x, but I don't know how to select only one.

proc sql;
    select x,y
    from (select x, y, count(*) as n_y
            from lab7.a
            group by x,y)
    group by x
    having n_y=max(n_y)
    ;
quit;

The code above gives me all such y for each x, but I would like to have only one y for each x.

I tried using order by n_y in a subquery and then select top 1, but apparently it doesn't work in sas. I also tried monotonic, but it adds number to rows to the whole query, it doesn't work within a group.

For example if I had data:

x y
A 1
A 2
A 2
B 1
B 2
C 3
C 3

my code will return

x y
A 2
B 1
B 2
C 3

but I would like

x y
A 2
B 1
C 3

or

x y
A 2
B 2
C 3

Upvotes: 2

Views: 2324

Answers (4)

Victor
Victor

Reputation: 17097

Why not:

proc sql;
select x,max(y) from table group by x;
quit;

Upvotes: 0

Joe
Joe

Reputation: 63424

Since SAS doesn't support the windowing functions or TOP X, I don't think it's technically doable in a single query. You can probably use MONOTONIC() in conjunction with outer query, though:

  select x,y from (
    select x,y, monotonic() as m
    from (select x, y, count(*) as n_y
            from have
            group by x,y)
    group by x
    having n_y=max(n_y)
    )
    group by x
    having m=max(m)
;

Or

 proc sql;
  select x,max(y) from (
    select x,y
    from (select x, y, count(*) as n_y
            from have
            group by x,y)
    group by x
    having n_y=max(n_y)
    )
    group by x
;
quit;

That of course gives you a biased result - if you're okay with that, then that's fine.

If you're not, what I would do is take that SQL output which has two, then use PROC SURVEYSELECT to pick one at random (with strata x and n=1).

Upvotes: 3

Joe
Joe

Reputation: 63424

The better, non-SQL way, to do this is to use PROC UNIVARIATE. The only restriction here is that it doesn't seem to work if every value is a mode (i.e., in your initial example, B has only 2 values, and both are modes). But as long as you have enough values to make this question make any sense, this should work.

proc univariate data=have;
  by x;
  var y;
  output out=want mode=mode_y;
run;

That gives you the mode, and probably does so much faster than the SQL query. It does require sorting by x, though, if it's not already. It gives you the lowest mode by default.

Upvotes: 2

Tom
Tom

Reputation: 51566

Why use PROC SQL?

 proc freq data=have ;
   tables x*y / noprint out=counts ;
 run;
 proc sort data=counts;
   by x descending count;
 run;
 data want ;
   set counts ;
   by x ;
   if first.x ;
 run;

Upvotes: 1

Related Questions