Reputation: 21
I have a data set, call it training, that I need the prediction of column Y using X. The conditional probability distribution of Y given X is in a different look up table and it is a Table distribution according to SAS terminology. The space of X and Y are both {1,2,...44}. The probability lookup table is 44 x 44.
Suppose my probability look up table is
x p1 p2 ... p44
1 0.001 0.004 ... 0.0078
2 0.0045 ... 0.000
.. ... ...
44 0.00089 ... 0.00067
The conditional probabilities are highly skewed. And my training data has a large N.
I am looking for an efficient, high speed way of matching and retrieving the conditional probabilities to give prediction of y in the training data. I am looking at hash to speed up the process and may be some other sampling method besides RAND('table', of p1-p44) to possibily make the program run faster?
below is a mock hash program for this task.
data prediction;
array prob[44] _temporary_;
if _n_ =1 then do;
if 0 then set work.prob_lookup;
dcl hash lookup(dataset:'work.prob_lookup');
look.definekey('x');
look.definedata('p1',...,'p44'); /* Is there a wildcard method to specify variable p1 to p44?
look.definedone();
end;
set work.training;
rc = lookup.find();
if rc = 0 then do;
y_predict =rand('table', p1-p44);
output;
end;
run;
The training data table for prediction has a large N, about 2k+, but should also allow more when the probability lookup table is applied for prediction. The probability lookup table is 44*44, relatively small, but highly skewed.
Help on building a faster code for this task is much appreciated. Also if there's an easy way to do it in R, it will be appreciated as well.
Upvotes: 0
Views: 85
Reputation: 51566
Not sure what HASH adds to this problem.
data prediction;
merge training prob_lookup;
by x;
y_predict=rand('table',of p1-p44);
run;
I guess it might save having to sort your training set by X? But if your X values are just integers between 1 and 44 then why not just use POINT= option?
data prediction;
set training ;
pointer=x;
if pointer in (1:44) then do;
set prob_lookup point=pointer ;
y_predict=rand('table',of p1-p44);
end;
else call missing(y_predict,of p1-p44);
run;
Upvotes: 1