Reputation: 31
I am trying to build a decision tree using sci-kit. But I have been getting the same value as a prediction for all the value.
le = preprocessing.LabelEncoder()
def labelEncoder(df, col_name):
df[[col_name]] = le.fit_transform(df[[col_name]])
labelEncoder(dfr, "Gender")
labelEncoder(dfr, "Subscription Tenure Type")
labelEncoder(dfr, "Located Region")
labelEncoder(dfr, "Attrition")
labelEncoder(dfr, "Type of subscription")
labelEncoder(dfr, "Genre")
# # Splitiing the data to test and train
feature = dfr[["Gender", "Age", "Subscription year", "Subscription Tenure Type", "Type of subscription",
"Located Region", "Average Hours of watching(Weekly)", "Attrition",
"Web channle utilization", "Mobile Channel Utilization"]]
labels = dfr[["Genre"]]
clf_gini = DecisionTreeClassifier(criterion="entropy", random_state=100,
max_depth=3, min_samples_leaf=9 ,min_samples_split=2, splitter='random')
clf_gini.fit(feature_train, labels_train)
y_pred = clf_gini.predict(feature_test)
print(list((y_pred)))
Following is the sample data.
User Id Genre Rating Gender Age Subscription year Subscription Tenure Type Type of subscription Located Region Average Hours of watching(Weekly) Attrition Web channle utilization Mobile Channel Utilization
1 Romance 4 Female 51 2000 Annual Individual R3 7 Yes 89 11
2 Action 4.769230769 Female 42 2004 6 Months Individual R6 13 No 88 12
2 Adventure 4.909090909 Female 42 2004 6 Months Individual R6 13 No 88 12
2 Comedy 4.2 Female 42 2004 6 Months Individual R6 13 No 88 12
2 Crime 5 Female 42 2004 6 Months Individual R6 13 No 88 12
2 Drama 4.2 Female 42 2004 6 Months Individual R6 13 No 88 12
Upvotes: 0
Views: 492
Reputation: 363
You were calling svm
instead of clf_gini
. If this does not answer your question, could you please provide some more details ?
The following example code works:
import pandas as pd
arr = [[1 , 'Romance', 4, 'Female', 51, 2000, 'Annual' , 'Individual' , 'R3', 7, 'Yes', 89, 11],
[2 , 'Action' , 4.7, 'Female', 42, 2004, '6 Months' , 'Individual', 'R6', 13, 'No', 88, 12],
[2 , 'Adventure', 4.9, 'Female', 42, 2004, '6 Months', 'Individual', 'R6', 13, 'No', 88, 12],
[2 , 'Comedy' , 4.2, 'Female', 42 , 2004, '6 Months' , 'Individual', 'R6' ,13, 'No', 88, 12],
[2 , 'Crime' , 5 , 'Female', 42 , 2004, '6 Months' , 'Individual', 'R6' , 13, 'No', 88, 12],
[2 , 'Drama' , 4.2, 'Female', 42, 2004, '6 Months' , 'Individual', 'R6', 13, 'No', 88, 12]]
headers = ['User Id', 'Genre', 'Rating', 'Gender', 'Age', 'Subscription year', 'Subscription Tenure Type', 'Type of subscription', 'Located Region', 'Average Hours of watching(Weekly)', 'Attrition', 'Web channle utilization', 'Mobile Channel Utilization']
dfr = pd.DataFrame(arr, columns = headers )
import sklearn
le = sklearn.preprocessing.LabelEncoder()
def labelEncoder(df, col_name):
df[[col_name]] = le.fit_transform(df[[col_name]])
labelEncoder(dfr, "Gender")
labelEncoder(dfr, "Subscription Tenure Type")
labelEncoder(dfr, "Located Region")
labelEncoder(dfr, "Attrition")
labelEncoder(dfr, "Type of subscription")
labelEncoder(dfr, "Genre")
# # Splitiing the data to test and train
feature = dfr[["Gender", "Age", "Subscription year", "Subscription Tenure Type", "Type of subscription",
"Located Region", "Average Hours of watching(Weekly)", "Attrition",
"Web channle utilization", "Mobile Channel Utilization"]]
clf_gini = DecisionTreeClassifier(criterion="entropy", random_state=100,
max_depth=3, min_samples_leaf=9 ,min_samples_split=2, splitter='random')
# create test / train split
dfr_train = dfr.iloc[:-1]
dfr_test = dfr.iloc[-1]
y_train = dfr_train['Genre']
y_test = dfr_test['Genre']
del dfr_train['Genre']
del dfr_test['Genre']
clf_gini.fit(dfr_train, y_train)
y_pred = clf_gini.predict(dfr_test)
print(list((y_pred)))
Upvotes: -1
Reputation: 855
There are a few issues with the code snippet you provided.
svm
instead of clf_gini
;Upvotes: 3