BitteB
BitteB

Reputation: 59

Identify most frequent number across variables and largest number of the most frequent (SAS)

I am working on an assignment where I need to identify the most frequent number across a range of variables. If there is a tie between two number, I also need SAS to return the highest value of the two most frequent number.

Using this answer (https://communities.sas.com/t5/General-SAS-Programming/Find-most-frequent-response-across-multiple-variables/td-p/269774), I know how to identify the most frequent number if there isn't a tie between two number. I now only need SAS to return the highest number if there is a tie. I think the problem arises in the last line before the 'run'-statement.

data have;
input id 1 x1 $ 4-5 x2 $ 7-8 x3 $ 10-11 x4 $ 13-14 x5 $ 16-17;
cards;
1 07 04 07 07 07
2 04 05 04 04 05
3 02 02    03
4 02 01 02 01
5 01 02 03    04
;
run;

data want;
set have;
length MostFreq $2;
array x x:;
array _t[10] _temporary_;
call missing(of _t[*]);
do _n_=1 to dim(x);
  if x[_n_] ne ' ' then _t[input(x[_n_],2.)]+1;
end;
Count=max(of _t[*]);
MostFreq=whichn(Count, of _t[*]);
run;

Upvotes: 1

Views: 213

Answers (3)

whymath
whymath

Reputation: 1394

Well, I have to say, The code you give is a very excellent example for whichn usage. And I will also praise the usage of array _t, really nice thought!

For the question itself, here is my answer.

data have;
input id 1 x1 $ 4-5 x2 $ 7-8 x3 $ 10-11 x4 $ 13-14 x5 $ 16-17;
cards;
1 07 04 07 07 07
2 04 05 04 04 05
3 02 02    03
4 02 01 02 01
5 01 02 03    04
;
run;

data want;
    set have;

    array x x1-x5;
    array y y1-y5;
    do i = 1 to dim(x);
        y[i] = count(catx('@',of x[*]),cats(x[i]));
    end;
    count = max(of y[*]);

    do i = 1 to dim(x);
        if y[i] = count then highest = highest <> input(x[i],best.);
    end;
    drop y: i;
run;

The assignment of y[i] assumed that x1 to x5 are between 0 and 9. If not, there should be some more restrict:

y[i] = count('@'||catx('@',of x[*]),catx('@','',x[i]));

Upvotes: 1

Richard
Richard

Reputation: 27508

WhichN will return the index of only the first (left to right) occurrence, so when there is a MODE tie you will not get the highest.

You can compute highest mode and count of mode at frequency bin update time.


data have;
input id (x1-x5) ($CHAR2. +1);
cards;
1 07 04 07 07 07
2 04 05 04 04 05
3 02 02    03
4 02 01 02 01
5 01 02 03    04
;

data want;
  set have;

  label 
    hmode_n = 'Mode (count)'
    hmode = 'Mode (highest)'
  ;

  array x x1-x5;
  array bins[00:99] _temporary_;  * freq table for two digit numbers;

  do index = 1 to dim(x);
    if missing(x[index]) then continue;
    value = input(x[index],2.);
    bins[value] + 1;
    if bins[value] > hmode_n then do; 
      hmode_n = bins[value];
      hmode = value;
    end;
    else
    if bins[value] = hmode_n and value > hmode then do;
      hmode = value;
    end;
  end;

  call missing(of bins(*));

  drop index value;
run;

Upvotes: 2

Stu Sztukowski
Stu Sztukowski

Reputation: 12884

First, adjust your input dataset so that all values are numeric rather than character:

data have;
    input id 1 x1 x2 x3 x4 x5;
    datalines;
1 07 04 07 07 07
2 04 05 04 04 05
3 02 02 . 03 .
4 02 01 02 01
5 01 02 03 .04
;
run;

Next, transpose the data by id. This will make it easier to work with. Once it is in a long format, you can more easily feed the data into procs to handle the calculations for you.

proc transpose data=have
               out=have2(rename=(col1 = value))
               name=var;
    by id;
    var x1-x5;
run;

proc rank can allow you to grab what you need.

proc rank data=have2 
          out=want 
          ties=high 
    ;

    by id;
    var value;
    ranks rank;
run;

proc sort data=want;
    by id rank;
run;

enter image description here

proc univariate is also an option to get your statistics of interest.

proc univariate data=have2;
    by id;
    id var;
    var value;
run;

enter image description here

Upvotes: 1

Related Questions