Reputation: 1952
I have some data in R with various variables for my cases:
B T H G S Z
Golf 1 1 1 0 1 0
Football 0 0 0 1 1 0
Hockey 1 0 0 1 0 0
Golf2 1 1 1 1 1 0
Snooker 1 0 1 0 1 1
I also have a vector of my expected output per case:
1, 2, 3, 1, 4
What I would like to do is identify variables that are not useful. In this example B and Z offer little ability to classify the data so I would like to be told that fact.
I looked at using multiple linear regression, however I don't want to separately type in and manipulate every variable/dimension as in my proper data it runs into the thousands, with tens of thousands of cases.
Any help on the best approach would be greatly appreciated.
Btw I'm not a statistician, I'm a software developer, so excuse me if the terminology isn't correct.
Upvotes: 4
Views: 5285
Reputation: 179478
you have asked quite a broad question, but I will try and be as precise as I can. But a note of caution: every statistical analysis method has a series of assumptions that are implicit. This means that if you rely on the results of a statistical model without understanding the limitations of the analysis, you could quite easily make the wrong conclusion.
It is also not quite clear to me what you mean by classification. If somebody asked me to do a classification analysis, I would probably consider things like cluster analysis, factor analysis or latent class analysis. there are some variants of linear regression modelling that could also be applicable.
That said, here is how you should go about doing a linear regression using your data.
First, replicate your sample data:
dat <- structure(list(B = c(1L, 0L, 1L, 1L, 1L), T = c(1L, 0L, 0L, 1L,
0L), H = c(1L, 0L, 0L, 1L, 1L), G = c(0L, 1L, 1L, 1L, 0L), S = c(1L,
1L, 0L, 1L, 1L), Z = c(0L, 0L, 0L, 0L, 1L)), .Names = c("B",
"T", "H", "G", "S", "Z"), class = "data.frame", row.names = c("Golf",
"Football", "Hockey", "Golf2", "Snooker"))
dat
B T H G S Z
Golf 1 1 1 0 1 0
Football 0 0 0 1 1 0
Hockey 1 0 0 1 0 0
Golf2 1 1 1 1 1 0
Snooker 1 0 1 0 1 1
Next, add the expected values:
dat$expected <- c(1,2,3,1,4)
dat
B T H G S Z expected
Golf 1 1 1 0 1 0 1
Football 0 0 0 1 1 0 2
Hockey 1 0 0 1 0 0 3
Golf2 1 1 1 1 1 0 1
Snooker 1 0 1 0 1 1 4
finally, we can start the analysis. Fortunately, lm
has a shortcut mechanism to tell it to use all of the columns in your data frame. To do this use the following formula: expected~.
:
fit <- lm(expected~., dat)
summary(fit)
Call:
lm(formula = expected ~ ., data = dat)
Residuals:
ALL 5 residuals are 0: no residual degrees of freedom!
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.00e+00 NA NA NA
B 1.00e+00 NA NA NA
T -3.00e+00 NA NA NA
H 1.00e+00 NA NA NA
G -4.71e-16 NA NA NA
S NA NA NA NA
Z NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 4 and 0 DF, p-value: NA
And the last word of caution. Since your sample data contained fewer rows than columns, the linear regression model has insufficient data to function. So in this case it simply discarded the last two columns. Your brief description of your data seems to indicate that you have far more rows and columns, so it ought not to be a problem for you.
Upvotes: 5
Reputation: 81
There are a lot of different approaches to consider. One basic starting point would be to do a principal component regression ( http://rss.acs.unt.edu/Rdoc/library/pls/html/svdpc.fit.html is one example). Lots of open questions - what distributions you expect, whether these variables are always boolean, or if they represent something like age or enumeration values for demographic slices.
https://stats.stackexchange.com/ has lots of experts for these kinds of questions.
Upvotes: 2