Classification using R in a data set with numeric and categorical variables

I'm working on a very big data-set.(csv)

The data set is composed from both numeric and categorical columns.

One of the columns is my "target column" , meaning i want to use the other columns to determine which value (out of 3 possible known values) is likely to be in the "target column". In the end check my classification vs the real data.

My question:

I'm using R.

I am trying to find a way to select the subset of features which give the best classifiation. going over all the subsets is impossible.

Does anyone know an algorithm or can think of a way do it on R?

Upvotes: 0

Answers (2)

Heidi Xiao

Reputation: 1

Since you have both numerical and categorical data, then you may try SVM.

I am using SVM and KNN on my numerical data and I also tried to apply DNN. DNN is pretty slow for training especially big data in R. KNN does not need to be trained, but is used for numerical data. And the following is what I am using. Maybe you can have a look at it.

#Train the model
y_train<-data[,1] #first col is response variable
x_train<-subset(data,select=-1) 
train_df<-data.frame(x=x_train,y=y_train)
svm_model<-svm(y~.,data=train_df,type="C")

#Test 
y_test<-testdata[,1]
x_test<-subset(testdata,select=-1)
pred<-predict(svm_model,newdata = x_test)
svm_t<-table(pred,y_test)
sum(diag(svm_t))/sum(svm_t) #accuracy

Upvotes: 0

Rwak

Reputation: 326

This seems to be a classification problem. Without knowing the amount of covariates you have for your target, can't be sure, but wouldn't a neural network solve your problem?

You could use the nnet package, which uses a Feed-forward neural network and works with multiple classes. Having categorical columns is not a problem since you could just use factors.

Without a datasample I can only explain it just a bit, but mainly using the function:

newNet<-nnet(targetColumn~ . ,data=yourDataset, subset=yourDataSubset [..and more values]..)

You obtain a trained neural net. What is also important here is the size of the hidden layer which is a tricky thing to get right. As a rule of thumb it should be roughly 2/3 of the amount of imputs + amount of outputs (3 in your case).

Then with:

myPrediction <- predict(newNet, newdata=yourDataset(with the other subset))

You obtain the predicted values. About how to evaluate them, I use the ROCR package but currently only supports binary classification, I guess a google search will show some help.

If you are adamant about eliminate some of the covariates, using the cor() function may help you to identify the less caracteristic ones.

Edit for a step by step guide:

Lets say we have this dataframe:

str(df)
'data.frame':   5 obs. of  3 variables:
 $ a: num  1 2 3 4 5
 $ b: num  1 1.5 2 2.5 3
 $ c: Factor w/ 3 levels "blue","red","yellow": 2 2 1 2 3

The column c has 3 levels, that is, 3 type of values it can take. This is something done by default by a dataframe when a column has strings instead of numerical values.

Now, using the columns a and b we want to predict which value c is going to be. Using a neural network. The nnet package is simple enough for this example. If you don't have it installed, use:

install.packages("nnet")

Then, to load it:

require(nnet)

after this, lets train the neural network with a sample of the data, for that, the function

portion<-sample(1:nrow(df),0.7*nrow(df))

will store in portion, 70% of the rows from the dataframe. Now, let's train that net! I recommend you to check the documentation for the nnet package with ?nnet for a deeper knowledge. Using only basics:

myNet<-nnet( c~ a+b,data=df,subset=portion,size=1)

c~ a+b is the formula for the prediction. You want to predict the column c using the columns a and b data= means the data origin, in this case, the dataframe df subset= self explanatory size= the size of the hidden layer, as I said, use about 2/3 of the total columns(a+b) + total outputs(1)

We have trained net now, lets use it.

Using predict you will use the trained net for new values.

newPredictedValues<-predict(myNet,newdata=df[-portion,])

After that, newPredictedValues will have the predictions.

Upvotes: 2

Classification using R in a data set with numeric and categorical variables

Answers (2)

Related Questions