Reputation: 544
I'm working on a very big data-set.(csv)
The data set is composed from both numeric and categorical columns.
One of the columns is my "target column" , meaning i want to use the other columns to determine which value (out of 3 possible known values) is likely to be in the "target column". In the end check my classification vs the real data.
My question:
I'm using R.
I am trying to find a way to select the subset of features which give the best classifiation. going over all the subsets is impossible.
Does anyone know an algorithm or can think of a way do it on R?
Upvotes: 0
Views: 12322
Reputation: 1
Since you have both numerical and categorical data, then you may try SVM.
I am using SVM and KNN on my numerical data and I also tried to apply DNN. DNN is pretty slow for training especially big data in R. KNN does not need to be trained, but is used for numerical data. And the following is what I am using. Maybe you can have a look at it.
#Train the model
y_train<-data[,1] #first col is response variable
x_train<-subset(data,select=-1)
train_df<-data.frame(x=x_train,y=y_train)
svm_model<-svm(y~.,data=train_df,type="C")
#Test
y_test<-testdata[,1]
x_test<-subset(testdata,select=-1)
pred<-predict(svm_model,newdata = x_test)
svm_t<-table(pred,y_test)
sum(diag(svm_t))/sum(svm_t) #accuracy
Upvotes: 0
Reputation: 326
This seems to be a classification problem. Without knowing the amount of covariates you have for your target, can't be sure, but wouldn't a neural network solve your problem?
You could use the nnet package, which uses a Feed-forward neural network and works with multiple classes. Having categorical columns is not a problem since you could just use factors.
Without a datasample I can only explain it just a bit, but mainly using the function:
newNet<-nnet(targetColumn~ . ,data=yourDataset, subset=yourDataSubset [..and more values]..)
You obtain a trained neural net. What is also important here is the size of the hidden layer which is a tricky thing to get right. As a rule of thumb it should be roughly 2/3 of the amount of imputs + amount of outputs (3 in your case).
Then with:
myPrediction <- predict(newNet, newdata=yourDataset(with the other subset))
You obtain the predicted values. About how to evaluate them, I use the ROCR package but currently only supports binary classification, I guess a google search will show some help.
If you are adamant about eliminate some of the covariates, using the cor() function may help you to identify the less caracteristic ones.
Edit for a step by step guide:
Lets say we have this dataframe:
str(df)
'data.frame': 5 obs. of 3 variables:
$ a: num 1 2 3 4 5
$ b: num 1 1.5 2 2.5 3
$ c: Factor w/ 3 levels "blue","red","yellow": 2 2 1 2 3
The column c has 3 levels, that is, 3 type of values it can take. This is something done by default by a dataframe when a column has strings instead of numerical values.
Now, using the columns a and b we want to predict which value c is going to be. Using a neural network. The nnet package is simple enough for this example. If you don't have it installed, use:
install.packages("nnet")
Then, to load it:
require(nnet)
after this, lets train the neural network with a sample of the data, for that, the function
portion<-sample(1:nrow(df),0.7*nrow(df))
will store in portion, 70% of the rows from the dataframe. Now, let's train that net! I recommend you to check the documentation for the nnet package with ?nnet
for a deeper knowledge. Using only basics:
myNet<-nnet( c~ a+b,data=df,subset=portion,size=1)
c~ a+b
is the formula for the prediction. You want to predict the column c using the columns a and b
data=
means the data origin, in this case, the dataframe df
subset=
self explanatory
size=
the size of the hidden layer, as I said, use about 2/3 of the total columns(a+b) + total outputs(1)
We have trained net now, lets use it.
Using predict
you will use the trained net for new values.
newPredictedValues<-predict(myNet,newdata=df[-portion,])
After that, newPredictedValues will have the predictions.
Upvotes: 2