Reputation: 380
I am pretty new to machine learning in general and scikit-learn in specific.
I am trying to use the example given on the site http://scikit-learn.org/stable/tutorial/basic/tutorial.html
For practicing on my own, I am using my own data-set. My data set is divided into two different CSV files:
Train_data.csv (Contains 32 columns, the last column is the output value).
Test_data.csv (Contains 31 columns the output column is missing - Which should be the case, no?)
Test data is one column less than training data..
I am using the following code to learn (using training data) and then predict (using test data).
The issue I am facing is the error:
*ValueError: X.shape[1] = 31 should be equal to 29, the number of features at training time*
Here is my code (sorry if it looks completely wrong :( )
import pandas as pd #import the library
from sklearn import svm
mydata = pd.read_csv("Train - Copy.csv") #I read my training data set
target = mydata["Desired"] #my csv has header row, and the output label column is named "Desired"
data = mydata.ix[:,:-3] #select all but the last column as data
clf = svm.SVC(gamma=0.001, C=100.) #Code from the URL above
clf.fit(data,target) #Code from the URL above
test_data = pd.read_csv("test.csv") #I read my test data set. Without the output column
clf.predict(test_data[-1:]) #Code from the URL above
The training data csv labels looks something like this:
Value1,Value2,Value3,Value4,Output
The test data csv labels looks something like this:
Value1,Value2,Value3,Value4.
Thanks :)
Upvotes: 1
Views: 4212
Reputation: 1645
Your problem is a Supervised Problem, you have some data in form of (input,output).
The input are the features describing your example and the output is the prediction that your model should respond given that input.
In your training data, you'll have one more attribute in your csv file because in order to train your model you need to give him the output.
The general workflow in sklearn with a Supervised Problem should look like this
X, Y = read_data(data)
n = len(X)
X_train, X_test = X[:n*0.8], X[n*0.8:]
Y_train, Y_test = Y[:n*0.8], Y[n*0.8:]
model.fit(X_train,Y_train)
model.score(X_test, Y_test)
To split your data, you can use train_test_split and you can use several metrics in order to judge your model's performance.
You should check the shape of your data
data.shape
It seems like you're not taking into the account the last 3 columns instead of only the last. Try instead :
data = mydata.ix[:,:-1]
Upvotes: 1