Loncar
Loncar

Reputation: 127

SAS: binning data

data scores;
    length variables $ 16;
    input variables $ low high score;
    datalines;
Debt -10000 1 55
Debt  1 10000 23
MAX_NA -1 1 500
MAX_NA 1 100 -240
;

data main_data;
    input ID Debt MAX_NA;
    datalines;
    222554 7584 12 
    212552 20 0 
    883123 500 7 
    913464 -200 -78  
;


data end_result;
    input ID Debt MAX_NA score;
    datalines;
    222554 7584 12 -217
    212552 20 0 523
    883123 500 7 -185 
    913464 -200 -78 555
;

Above you'll find three data sets.

  1. The scores data sets depict each variables' score, based on a range of values between low and high columns.

  2. The second data set main_data shows the exact values of Debt and MAX_NA.

  3. end_result table is what I would like to achieve.

What step and statements should I use to calculate the score and get the end_result table?

Upvotes: 1

Views: 235

Answers (3)

PeterClemmensen
PeterClemmensen

Reputation: 4937

Simpler:

data end_result(keep=ID Debt MAX_NA score);
   set main_data;
   score = 0;
   do i = 1 to n;
      set scores(rename=score=s) point=i nobs=n;
          if      variables = "Debt" and low <= Debt <= high then score + s;
          else if variables = "MAX_NA" and low <= MAX_NA <= high then score + s;
   end;
run;

Upvotes: 2

Jeppe R&#248;mer Juul
Jeppe R&#248;mer Juul

Reputation: 324

Another apprach is to use a double left join like so:

data scores;
    length variables $ 16;
    input variables $ low high score;
    datalines;
Debt -10000 1 55
Debt  1 10000 23
MAX_NA -1 1 500
MAX_NA 1 100 -240
;

data main_data;
    input ID Debt MAX_NA;
    sortseq = _n_;
    datalines;
    222554 7584 12 
    212552 20 0 
    883123 500 7 
    913464 -200 -78  
;


proc sql;
   create table end_result as 
      select a.ID 
            ,a.Debt
            ,a.MAX_NA
            ,coalesce(b.score,0) + coalesce(c.score,0) as score
      from main_data as a
      left join scores(where=(variables="Debt")) as b
         on b.low < a.Debt <= b.high
      left join scores(where=(variables="MAX_NA")) as c
         on c.low < a.MAX_NA <= c.high
      order by a.sortseq
   ;
quit;

Note that I have included a sortseq variable in main_data to keep the sorting order. Like draycut I get the same score for id 222554 and 883123. For ID 913464 the MAX_NA value is out of range of the scores dataset, so I have counted it as zero by using the coalesce funtion. I therefore get the results:

ID     Debt  MAX_NA score 
222554 7584  12     -217 
212552 20    0       523 
883123 500   7      -217 
913464 -200 -78      55 

Upvotes: 2

PeterClemmensen
PeterClemmensen

Reputation: 4937

I don't understand why id 222554 and 883123 do not get the same score?

Anyway, here is an approach you can use as a template.

data end_result;

   if _N_ = 1 then do;
      dcl hash h(dataset : "scores(rename=score=s)", multidata : "Y");
      h.definekey("variables");
      h.definedata(all : "Y");
      h.definedone();
      dcl hiter hi("h");
   end;

   set main_data;
   if 0 then set scores(rename=score=s);
   score = 0;

   do while (hi.next() = 0);
      if      variables = "Debt" and low <= Debt <= high then score + s;
      else if variables = "MAX_NA" and low <= MAX_NA <= high then score + s;
   end;

   keep id Debt max_na score;

run;

Result:

ID     Debt  MAX_NA score 
222554 7584  12     -217 
212552 20    0       523 
883123 500   7      -217 
913464 -200 -78      555 

Upvotes: 1

Related Questions