Reputation: 1039
I have a small dataset consisting of three distinct observations on each of three variables, say x1 x2, x3 and the accompanying response y, on which I would like to perform Analysis of Variance to test whether the means are equal.
data anova;
input var obs resp;
cards;
1 1 1.1
1 2 .5
1 3 -2.1
2 1 4.2
2 2 3.7
2 3 .8
3 1 3.2
3 2 2.8
3 3 6.3
;
proc anova data=anova;
class var;
model resp=var;
run;
All good so far. Now however, I would like to use a permutation test to check the p-value of the F-statistic. The test is quite simple and it consists of randomly reassigning the 9 observations to the 3 variables so that each variable has three observations, and calculating the F-statistic each time. The p-value would then be the proportion of these statistics that are higher than 4.39, the value of the F-test from the code above.
Normally I would do it by hand but there is 1680 possible combinations( 9!/(3!3!3!) )here so it would take me a week. Is there a more elegant way to accomplish this? Perhaps a way to wrap this procedure in a loop or a function?
I would be grateful for your help, thank you.
Upvotes: 2
Views: 287
Reputation: 1511
This is my approach. It employs the allperm
routine of SAS to populate all permutations. Then in order to eliminate duplicates, I use the product of the numbers in a group as the key. The result is 1680.
After this, you can use by
keyword of proc glm
to run based on the group indicator group
in the final data set.
proc transpose data=anova(keep=resp) out=anova1;
run;
quit;
data anova1;
set anova1;
n = fact(9);
array rs (*) col1-col9;
do i=1 to n;
call allperm(i, of rs(*));
a = rs(1)*rs(2)*rs(3);
b = rs(4)*rs(5)*rs(6);
c = rs(7)*rs(8)*rs(9);
file resperms notitles;
put rs(*) a b c;
end;
run;
data perms;
infile resperms;
input x1-x3 y1-y3 z1-z3 a b c;
run;
proc sort data=perms nodupkey;
by a b c;
run;
data perms;
set perms;
group = _N_;
drop a b c;
run;
proc transpose data=perms out=perms;
by group;
run;
quit;
data perms;
set perms;
var = substr(_NAME_,1,1);
obs = substr(_NAME_,2,1)*1;
rename col1=resp;
drop _NAME_;
run;
Upvotes: 1
Reputation: 3310
Resampling like this can be done with a few steps. First, following an example from SAS help, use PROC SQL to create a table with all the possible combinations of var and resp.
PROC SQL;
CREATE TABLE possibilities AS SELECT a.var, b.resp FROM
anova a CROSS JOIN anova b;
QUIT;
Then, resample from all the possible combinations of var and resp using PROC SURVEYSELECT.
PROC SURVEYSELECT
DATA = possibilities
OUT = permutations
METHOD = URS /* URS means unrestricted sampling, or sampling uniformly with
replacement */
SAMPSIZE = 9 /* Generate samples that are the same size as your original
data */
REP = 1000; /* Repeat 1000 times */
STRATA var / ALLOC = PROP; /* Make sure that we sample evenly from each of the
3 values of var */
RUN;
Then, using PROC GLM with a BY statement, calculate the F statistic for each replicate.
/* Data must be sorted to use a BY statement */
PROC SORT
DATA = permutations;
BY replicate;
RUN;
PROC GLM
DATA = permutations
NOPRINT
OUTSTAT = f_statistics;
CLASS var;
MODEL resp = var;
BY replicate;
WEIGHT numberhits;
QUIT;
Finally, in a data step, create a dummy variable indicating whether each F statistic is greater than 4.39, the test statistic from the example, and then use the MEANS procedure to get the fraction of times that was true. The final answer should be approximately 0.068, as in the original example.
DATA final;
SET f_statistics;
IF f ne .; /* Drop rows of the GLM output that don't contain an F statistic */
reject_null = (f > 4.39);
RUN;
PROC MEANS
DATA = final
MEAN;
VAR reject_null;
RUN;
Upvotes: 2