Reputation: 121
I am writing a very simple script. All I have to do is read data using panda and then train a decision tree on data. Data that I am using is:
https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data
And following is my script
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn import preprocessing
import pandas as pd
balance_data=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
sep= ',', header= None)
#print "Dataset:: "
#df1.head()
X = balance_data.values[:, 0:5]
Y = balance_data.values[:,6]
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.2, random_state = 100)
clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,
max_depth=3, min_samples_leaf=5)
clf_gini.fit(X_train, y_train)
From the error I am guessing that it couldn't convert "med" attribute value to float. And by looking at the data my random guess is that low has a space before it and med doesn't. That is why it is getting confused. But I am not sure of it. Please tell what could be wrong with it. PS: error is occurring at the last line and here is the traceback
ValueError Traceback (most recent call last)
<ipython-input-26-b495e5f26174> in <module>()
18 max_depth=3, min_samples_leaf=5)
19 X_train[X_train != '']
---> 20 clf_gini.fit(X_train, y_train)
/home/fatima/anaconda2/lib/python2.7/site-packages/sklearn/tree/tree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
788 sample_weight=sample_weight,
789 check_input=check_input,
--> 790 X_idx_sorted=X_idx_sorted)
791 return self
792
/home/fatima/anaconda2/lib/python2.7/site-packages/sklearn/tree/tree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
114 random_state = check_random_state(self.random_state)
115 if check_input:
--> 116 X = check_array(X, dtype=DTYPE, accept_sparse="csc")
117 y = check_array(y, ensure_2d=False, dtype=None)
118 if issparse(X):
/home/fatima/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
400 force_all_finite)
401 else:
--> 402 array = np.array(array, dtype=dtype, order=order, copy=copy)
403
404 if ensure_2d:
ValueError: could not convert string to float: med
Upvotes: 4
Views: 20304
Reputation: 14689
The dataset looks like this:
0 1 2 3 4 5 6
0 vhigh vhigh 2 2 small low unacc
1 vhigh vhigh 2 2 small med unacc
2 vhigh vhigh 2 2 small high unacc
3 vhigh vhigh 2 2 med low unacc
4 vhigh vhigh 2 2 med med unacc
Where the data types (dtypes) are all objects. However, machine learning algorithms can only learn from numbers (int, float, doubles .. ) thus, you need to encode your data before you use it for training.
There are several ways to encode your data, one way is to use label encoding
, to do that, add the following lines to your code just after loading the dataset:
le = preprocessing.LabelEncoder()
balance_data = balance_data.apply(le.fit_transform)
Now the data in balance_data
looks like this:
0 1 2 3 4 5 6
0 3 3 0 0 2 1 2
1 3 3 0 0 2 2 2
2 3 3 0 0 2 0 2
3 3 3 0 0 1 1 2
4 3 3 0 0 1 2 2
where all data types are int.
In general, you need to perform some data preprocessing before training/fitting your model. For that, I recommend that you go through some tutorial to understand the process. For instance, check this:
Here's the overall code with the fix:
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn import preprocessing
import pandas as pd
balance_data=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
sep= ',', header= None)
#print "Dataset:: "
#df1.head()
le = preprocessing.LabelEncoder()
balance_data = balance_data.apply(le.fit_transform)
X = balance_data.values[:, 0:5]
Y = balance_data.values[:,6]
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.2, random_state = 100)
clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,
max_depth=3, min_samples_leaf=5)
clf_gini.fit(X_train, y_train)
Upvotes: 9
Reputation: 395
I've checked the file you are trying to process, and I found this is the data:
vhigh,vhigh,2,2,small,med,unacc
vhigh,vhigh,2,2,small,high,unacc
vhigh,vhigh,2,2,med,low,unacc
vhigh,vhigh,2,2,med,med,unacc
vhigh,vhigh,2,2,med,high,unacc
So when you ask for train the model, internally is trying to convert your vector into numbers, but founds strings values (as "small", "med", "high", etc) wich are not parseables to a number.
A good start could be normalize your categorical values into onehot encoding. Check it here:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
Upvotes: 0