Reputation:
I have a small dataset based on a survey(about 80 obsv) & on which i want to perform a logistic regression using SAS.
My survey contains some variables (named X1,X2,X3) that i want to reunite as categories of a new created variable named X4.
The problem is that those variables X1-X3 already have categories (YES/NO/WITHOUT OPINION)
How can i reunite them as categories of X4 but with considering the values that they have ?
to help you understand my question :
Y(=1/0) = X1 X2 X3
X1-X3 each have 3 categories (YES/NO/WITHOUT OPINION)
What i want is :
Proc logistic data = have ; model Y = X4 and others such as age, city... but X4 can take 3 values.
The problem isn't creating X4 based on X1-X3 but how to affect X4 the values that X1-X3 each takes ?
(NB: i say X1-X3 but it's more)
I do this in SAS but even a theorical explanation would be helpful !
Thank you.
Upvotes: 0
Views: 156
Reputation: 63424
I think that the comments are right for the most part - this probably won't help your regression.
But - to answer how to literally do this; usually what you would do is to use powers of 2 (or 3).
So, for typical "yes/no" where you don't care about the 3rd one, you'd assign things like this:
x4 = (x1) + (x2 * 2) + (x3 * 4);
Then the values would be like this:
0 = (0,0,0)
1 = (1,0,0)
2 = (0,1,0)
3 = (1,1,0)
4 = (0,0,1)
5 = (1,0,1)
6 = (0,1,1)
7 = (1,1,1)
If you actually want the "no opinion" to be a category (this is complicated, but it's not ideal in many cases to include people with "no opinion" unless having an opinion is actually relevant, it's better to exclude them or to impute the value), then you would do this with powers of 3. It works the same way as the powers of 2, you just have a lot more category combinations (27 total).
x4 = (x1) + (x2 * 3) + (x3 * 9);
Just make sure they're 0/1/2 coded, not 1/2/3; if they're 1/2/3 then subtract one during the multiplication.
What else can you do that's better? You can do a bunch of things theoretically that are superior to this actual categorization (which really doesn't help your overfitting at all).
One term that's helpful is "collapsing"; see for example this paper by Bruce Lund et al for example (Plug: Bruce is giving a (not free) class in regression for WUSS later this month. You can use ANOVA to analyze which variables contribute to your model. You can use some other procedures like GLMSELECT as well; this is a major topic in regression in general.
You could also look into factor analysis, like in this SAS Book excerpt.
Upvotes: 1