Advice on a faster SAS code

Question

I have a data set, call it training, that I need the prediction of column Y using X. The conditional probability distribution of Y given X is in a different look up table and it is a Table distribution according to SAS terminology. The space of X and Y are both {1,2,...44}. The probability lookup table is 44 x 44.

Suppose my probability look up table is

x  p1     p2               ...                p44  
1  0.001  0.004            ...                0.0078
2  0.0045                  ...                0.000
..                         ...                ... 
44 0.00089                 ...                0.00067

The conditional probabilities are highly skewed. And my training data has a large N.

I am looking for an efficient, high speed way of matching and retrieving the conditional probabilities to give prediction of y in the training data. I am looking at hash to speed up the process and may be some other sampling method besides RAND('table', of p1-p44) to possibily make the program run faster?

below is a mock hash program for this task.

data prediction;
array prob[44] _temporary_;

if _n_ =1 then do;
if 0 then set work.prob_lookup;
dcl hash lookup(dataset:'work.prob_lookup');
look.definekey('x');
look.definedata('p1',...,'p44'); /* Is there a wildcard method to specify variable p1 to p44?
look.definedone();
end;


set work.training;
rc = lookup.find();
if rc = 0 then do;
    y_predict =rand('table', p1-p44);
output;
end;
run;

The training data table for prediction has a large N, about 2k+, but should also allow more when the probability lookup table is applied for prediction. The probability lookup table is 44*44, relatively small, but highly skewed.

Help on building a faster code for this task is much appreciated. Also if there's an easy way to do it in R, it will be appreciated as well.

Advice on a faster SAS code

Answers (1)

Related Questions