huzefausama
huzefausama

Reputation: 443

Numpy Error "ValueError: Found array with dim 3. Estimator expected <= 2."

I am a complete newbie in ML with scikit-learn I just wanted this to work after a lot of time that i spent on learning what ML was its types and so on.


from sklearn import tree
import pandas as pd
import numpy as np

df = pd.read_csv('test.csv')

age = df.Age.to_list()
age = np.array(age).reshape(-1,1)

inc = df.Income.to_list()
inc = np.array(inc).reshape(-1,1)

stud = df.Student.to_list()
stud = np.array(stud).reshape(-1,1)

buy = df.Buy.to_list()
buy = np.array(buy).reshape(-1,1)

X = [age,inc,stud]
y = [[buy]]

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)
'''
Income:
1 - high
2 - medium
3 - low

Student:
1 - yes
2 - no

'''
age = 34
inc = 1
stud = 2


pred = clf.predict(age,ince,stud)

print(pred)

But i get this error:

Traceback (most recent call last): File "D:\Huzefa\Desktop\ML.py", line 23, in clf = clf.fit(X, y) File "C:\Users\Huzefa\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\tree_classes.py", line 894, in fit X_idx_sorted=X_idx_sorted) File "C:\Users\Huzefa\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\tree_classes.py", line 158, in fit check_y_params)) File "C:\Users\Huzefa\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\base.py", line 429, in _validate_data X = check_array(X, **check_X_params) File "C:\Users\Huzefa\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\utils\validation.py", line 73, in inner_f return f(**kwargs) File "C:\Users\Huzefa\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\utils\validation.py", line 642, in check_array % (array.ndim, estimator_name)) ValueError: Found array with dim 3. Estimator expected <= 2.

if i could just correct my script to make it work i will be motivated to continue further with ML All help is greatly appreciated!

Upvotes: 0

Views: 574

Answers (1)

el123456789
el123456789

Reputation: 66

The way you're defining your X and y seems overcomplicated to me, is there a specific reason behind that choice? You could also do the following:

X = df[["Age","Income","Student"]]
y = df.Buy

Also, by doing

clf = clf.fit(X, y)

you're training your decision tree on all the data available. If this is a train dataset and you have a test dataset stored elsewhere, that's okay; if not, you need to split the data first, so you can train the model AND test the efficiency of said training. train_test_split is a useful function for this.

Upvotes: 1

Related Questions