Reputation: 443
I am a complete newbie in ML with scikit-learn I just wanted this to work after a lot of time that i spent on learning what ML was its types and so on.
from sklearn import tree
import pandas as pd
import numpy as np
df = pd.read_csv('test.csv')
age = df.Age.to_list()
age = np.array(age).reshape(-1,1)
inc = df.Income.to_list()
inc = np.array(inc).reshape(-1,1)
stud = df.Student.to_list()
stud = np.array(stud).reshape(-1,1)
buy = df.Buy.to_list()
buy = np.array(buy).reshape(-1,1)
X = [age,inc,stud]
y = [[buy]]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)
'''
Income:
1 - high
2 - medium
3 - low
Student:
1 - yes
2 - no
'''
age = 34
inc = 1
stud = 2
pred = clf.predict(age,ince,stud)
print(pred)
But i get this error:
Traceback (most recent call last): File "D:\Huzefa\Desktop\ML.py", line 23, in clf = clf.fit(X, y) File "C:\Users\Huzefa\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\tree_classes.py", line 894, in fit X_idx_sorted=X_idx_sorted) File "C:\Users\Huzefa\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\tree_classes.py", line 158, in fit check_y_params)) File "C:\Users\Huzefa\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\base.py", line 429, in _validate_data X = check_array(X, **check_X_params) File "C:\Users\Huzefa\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\utils\validation.py", line 73, in inner_f return f(**kwargs) File "C:\Users\Huzefa\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\utils\validation.py", line 642, in check_array % (array.ndim, estimator_name)) ValueError: Found array with dim 3. Estimator expected <= 2.
if i could just correct my script to make it work i will be motivated to continue further with ML All help is greatly appreciated!
Upvotes: 0
Views: 574
Reputation: 66
The way you're defining your X and y seems overcomplicated to me, is there a specific reason behind that choice? You could also do the following:
X = df[["Age","Income","Student"]]
y = df.Buy
Also, by doing
clf = clf.fit(X, y)
you're training your decision tree on all the data available. If this is a train dataset and you have a test dataset stored elsewhere, that's okay; if not, you need to split the data first, so you can train the model AND test the efficiency of said training. train_test_split
is a useful function for this.
Upvotes: 1