Eula Carrara
Eula Carrara

Reputation: 13

How to select a percentage of values from a column in SAS?

I have 70 databases of different sizes (same number of columns, different numbers of lines). I need to get the 25% higher values and the 25% lower values considering a given column VAR1.

I have:

id VAR1
1    10
2    -5
3   -12
4     7
5    12
6     7
7    -9
8   -24
9     0
10    6
11  -18
12   22

Sorting by VAR1, I need to select the rows (all columns) containing the 3 smallest and the 3 largest (25% from each extreme), i.e.,

id  VAR1
8   -24
11  -18
3   -12
7   -9
2   -5
9   0
10  6
4   7
6   7
1   10
5   12
12  22

I need to keep in the database the rows (all columns) that contain the VAR1 equal to -24, -18, -12, 10, 12 and 22.

id  VAR1
8   -24
11  -18
3   -12
1   10
5   12
12  22

What I’ve been thinking:

  1. Order column VAR1 in ascending order;
  2. Create a numbered column from 1 to N (n=_N_) - in this case, N=12;
  3. I do a=N*0.25 (to have the value that represents 25%);
  4. I do b=N-a (to have the value that represents the "last" 25%).

So, I can use keep: if N<a.... I will have the first 25% (the smallest). if N>b.... I will have the last 25% (the largest).

I can calculate a and b. But I’m not getting the maximum value of N in this case 12.

I will repeat this for the 70 database, I would not like to have to enter this maximum value every time (it varies from one database to another).

I need help to "fix" the maximum value (N) without having to type it (even if it is repeated in all the lines of another "auxiliary column"). Or if there’s some better way to get those 25% from each end.

My code:

proc sort data=have; by VAR1; run;
data want; set have;
seq=_N_;
N=max(seq); *N=max. value of lines. (I stopped here and don’t know if below is right);
a=N*0.25;
b=N-b;
if N<a;
if N>b;
run;

Thank you very much!

Upvotes: 0

Views: 701

Answers (1)

Richard
Richard

Reputation: 27498

Proc RANK computes percentiles that you can use to select the desired rows.

Example:

data have1 have2 have3 have4 have5;
  do id = 1 to 100;
    X = ceil(rand('normal', 0, 10));
    if id < 60 then output have1;
    if id < 70 then output have2;
    if id < 80 then output have3;
    if id < 90 then output have4;
    if id < 100 then output have5;
  end;
run;

proc rank data=have1 percent out=want1(where=(pct not between 25 and 75)) ;
  var x;
  ranks pct;
run;

proc rank data=have2 percent out=want2(where=(pct not between 25 and 75)) ;
  var x;
  ranks pct;
run;

proc rank data=have3 percent out=want3(where=(pct not between 25 and 75)) ;
  var x;
  ranks pct;
run;

Upvotes: 0

Related Questions