Reputation: 1194
I've been trying to make a prediction that consists of a DataFrame from the model I've made using the Decision Tree algorithm.
I have got the score for my model, which is 0.96. Then, I have tried to use the model to make a prediction from DataFrame people who stay but got an error. The goal is to predict people who will leave the company in the future based on DataFrame who stay.
How to achieve that goal?
So what I did is:
df = pd.read_csv('https://raw.githubusercontent.com/bhaskoro-muthohar/DataScienceLearning/master/HR_comma_sep.csv')
leftdf = df[df['left']==1]
notleftdf =df[df['left']==0]
df.salary = df.salary.map({'low':0,'medium':1,'high':2})
df.salary
X = df.drop(['left','sales'],axis=1)
y = df['left']
import numpy as np
from sklearn.model_selection import train_test_split
#splitting the train and test sets
X_train, X_test, y_train, y_test= train_test_split(X,y,random_state=0, stratify=y)
from sklearn import tree
clftree = tree.DecisionTreeClassifier(max_depth=3)
clftree.fit(X_train,y_train)
y_pred = clftree.predict(X_test)
print("Test set prediction:\n {}".format(y_pred))
print("Test set score: {:.2f}".format(clftree.score(X_test, y_test)))
The result is
Test set score: 0.96
X_new = notleftdf.drop(['left','sales'],axis=1)
#Map salary to 0,1,2
X_new.salary = X_new.salary.map({'low':0,'medium':1,'high':2})
X_new.salary
prediction_will_left = clftree.predict(X_new)
print("Prediction: {}".format(prediction_will_left))
print("Predicted target name: {}".format(
notleftdf['left'][prediction_will_left]
))
The error I got is:
KeyError: "None of [Int64Index([0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n ...\n 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],\n dtype='int64', length=11428)] are in the [index]"
How to solve it?
PS: For full script link is here
Upvotes: 3
Views: 585
Reputation: 169174
Maybe you're looking for something like this. (Self-contained script once you download the data file to the same directory.)
from sklearn import tree
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
def process_df_for_ml(df):
"""
Process a dataframe for model training/prediction use.
Returns X/y tensors.
"""
df = df.copy()
# Map salary to 0,1,2
df.salary = df.salary.map({"low": 0, "medium": 1, "high": 2})
# dropping left and sales X for the df, y for the left
X = df.drop(["left", "sales"], axis=1)
y = df["left"]
return (X, y)
# Read and reindex CSV.
df = pd.read_csv("HR_comma_sep.csv")
df = df.reindex()
# Train a decision tree.
X, y = process_df_for_ml(df)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, stratify=y)
clftree = tree.DecisionTreeClassifier(max_depth=3)
clftree.fit(X_train, y_train)
# Test the decision tree on people who haven't left yet.
notleftdf = df[df["left"] == 0].copy()
X, y = process_df_for_ml(notleftdf)
# Plug in a new column with ones and zeroes from the prediction.
notleftdf["will_leave"] = clftree.predict(X)
# Print those with the will-leave flag on.
print(notleftdf[notleftdf["will_leave"] == 1])
Upvotes: 2