Reputation: 43
I am very new to python 2.7 I am trying to run Decision Tree Classifier on my dataset but following a tutorial I face this problem I have first vectorized my features columns and saved it into a array and later saved target column in a array using labelencoder. Please can you explain me how do I fix this Error?
Data :
Code:
import pandas as pd
dataset = "C:/Users/ashik swaroop/Desktop/anaconda/Gene Dataset/Final.csv"
datacan = pd.read_csv(dataset)
datacan = datacan.fillna('')
features = datacan[[
"Tumour_Types_Somatic","Tumour_Types_Germline",
"Cancer_Syndrome","Tissue_Type",
"Role_in_Cancer","Mutation_Types","Translocation_Partner",
"Other_Syndrome","Tier","Somatic","Germline",
"Molecular_Genetics","Other_Germline_Mut"]]
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder
X_dict = features.to_dict().values()
vect = DictVectorizer(sparse=False)
X_vector = vect.fit_transform(X_dict)
le = LabelEncoder()
y_train = le.fit_transform(datacan['Gene_Symbol'][:-1])
X_Train = X_vector[:-1]
X_Test = X_vector[-1:]
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(X_Train,y_train) `
I am Getting this Error :
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(X_Train,y_train)
Traceback (most recent call last):
File "<ipython-input-49-fef4fc045a54>", line 4, in <module>
clf = clf.fit(X_Train,y_train)
File "C:\Users\ashik swaroop\Anaconda2\lib\site-
packages\sklearn\tree\tree.py", line 739, in fit
X_idx_sorted=X_idx_sorted)
File "C:\Users\ashik swaroop\Anaconda2\lib\site-
packages\sklearn\tree\tree.py", line 240, in fit
"number of samples=%d" % (len(y), n_samples))
ValueError: Number of labels=21638 does not match number of samples=12
Traceback (most recent call last):
File "<ipython-input-49-fef4fc045a54>", line 4, in <module>
clf = clf.fit(X_Train,y_train)
File "C:\Users\ashik swaroop\Anaconda2\lib\site-
packages\sklearn\tree\tree.py", line 739, in fit
X_idx_sorted=X_idx_sorted)
File "C:\Users\ashik swaroop\Anaconda2\lib\site-
packages\sklearn\tree\tree.py", line 240, in fit
"number of samples=%d" % (len(y), n_samples))
ValueError: Number of labels=21638 does not match number of samples=12
Upvotes: 2
Views: 6850
Reputation: 2723
First, to understand the error:
It seems that your number of training samples (ie. np.shape(X_train)[0]
) does not match the number of labels (ie. np.shape(y_train)[0]
).
When looking at your code I am noticing some inconsistencies. Please refer to the inline comments below.
import pandas as pd
from apyori import apriori
dataset = "C:/Users/ashik swaroop/Desktop/anaconda/Gene Dataset/Final.csv"
datacan = pd.read_csv(dataset)
datacan = datacan.fillna('')
features = datacan[[
"Tumour_Types_Somatic","Tumour_Types_Germline",
"Cancer_Syndrome","Tissue_Type",
"Role_in_Cancer","Mutation_Types","Translocation_Partner",
"Other_Syndrome","Tier","Somatic","Germline",
"Molecular_Genetics","Other_Germline_Mut"]]
# EDIT replace by features = [
#"Tumour_Types_Somatic","Tumour_Types_Germline",
#"Cancer_Syndrome","Tissue_Type",
#"Role_in_Cancer","Mutation_Types","Translocation_Partner",
#"Other_Syndrome","Tier","Somatic","Germline",
#"Molecular_Genetics","Other_Germline_Mut"]
orders = datacan[features].to_dict( orient = 'records' ) # this variable is not used
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder
X_dict = features.to_dict().values() # try replacing this line with X_dict = orders
vect = DictVectorizer(sparse=False)
X_vector = vect.fit_transform(X_dict)
le = LabelEncoder()
y_train = le.fit_transform(datacan['Gene_Symbol'][:-1])
X_Train = X_vector[:-1]
X_Test = X_vector[-1:]
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(X_Train,y_train)
Upvotes: 1