user159193
user159193

Reputation: 21

Advice on a faster SAS code

I have a data set, call it training, that I need the prediction of column Y using X. The conditional probability distribution of Y given X is in a different look up table and it is a Table distribution according to SAS terminology. The space of X and Y are both {1,2,...44}. The probability lookup table is 44 x 44.

Suppose my probability look up table is

x  p1     p2               ...                p44  
1  0.001  0.004            ...                0.0078
2  0.0045                  ...                0.000
..                         ...                ... 
44 0.00089                 ...                0.00067

The conditional probabilities are highly skewed. And my training data has a large N.

I am looking for an efficient, high speed way of matching and retrieving the conditional probabilities to give prediction of y in the training data. I am looking at hash to speed up the process and may be some other sampling method besides RAND('table', of p1-p44) to possibily make the program run faster?

below is a mock hash program for this task.

data prediction;
array prob[44] _temporary_;

if _n_ =1 then do;
if 0 then set work.prob_lookup;
dcl hash lookup(dataset:'work.prob_lookup');
look.definekey('x');
look.definedata('p1',...,'p44'); /* Is there a wildcard method to specify variable p1 to p44?
look.definedone();
end;


set work.training;
rc = lookup.find();
if rc = 0 then do;
    y_predict =rand('table', p1-p44);
output;
end;
run;

The training data table for prediction has a large N, about 2k+, but should also allow more when the probability lookup table is applied for prediction. The probability lookup table is 44*44, relatively small, but highly skewed.

Help on building a faster code for this task is much appreciated. Also if there's an easy way to do it in R, it will be appreciated as well.

Upvotes: 0

Views: 85

Answers (1)

Tom
Tom

Reputation: 51566

Not sure what HASH adds to this problem.

data prediction;
   merge training prob_lookup;
   by x;
   y_predict=rand('table',of p1-p44);
run;

I guess it might save having to sort your training set by X? But if your X values are just integers between 1 and 44 then why not just use POINT= option?

data prediction;
   set training ;
   pointer=x;
   if pointer in (1:44) then do;
     set prob_lookup point=pointer ;
     y_predict=rand('table',of p1-p44);
   end;
   else call missing(y_predict,of p1-p44);
run;

Upvotes: 1

Related Questions