olive
olive

Reputation: 189

How can you split a node based on a categorical variable in Scikit Learn Decision Tree?

I am trying to make a decision tree for the following dataset: https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice

This dataset contains some categorical variables (for example Husband's occupation: 1, 2, 3, 4 ). When I create my Decision Tree, the categorical values are split based on 'smaller or greater than' value. In other words, there is a node in my tree that splits the data as follows: "Occupation Husband <= 2.5". How can I adjust this code so that it keeps into account categorical variables? When I print 'data.info()', the datatypes are correct.

import pandas as pd
import numpy as np
import os

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split

from matplotlib import pyplot as plt
import seaborn as sns

import graphviz
import pydotplus
import io
from scipy import misc

os.chdir("path") #path containing datacontra.csv file

data = pd.read_csv("datacontra.csv", dtype={'Age': np.float64, 'EduW':np.object, 'EduH':np.object, 'Child': np.int64, 'ReliW': np.object, 'WorkW':np.object, 'OccuH': np.object, 'SOLI': np.object, 'MediaExp': np.object, 'T':np.object})

data.describe()
data.head()
data.tail()

data.info()

train, test = train_test_split(data,test_size = 0.05)
print("Training size" + str(len(train)))
print("Test size " + str(len(test)))
train.shape

features = list(data.columns[:9])
label = list(data.columns[9])
print(list(data.columns[:9]))
print(list(data.columns[9]))

X_train = train[features]
print(X_train.shape)
y_train = train[label]
print(y_train.shape)

X_test= test[features]
y_test = test[label]

c = DecisionTreeClassifier()

dt = c.fit(X_train,y_train)

path = ("/Users/sabinekuypers/Documents/Charlotte 461/")
def show_tree(tree, features, path):
    f = io.StringIO()
    export_graphviz(tree, out_file=f, feature_names = features)
    pydotplus.graph_from_dot_data(f.getvalue()).write_png(path)
    img = misc.imread(path)
    plt.rcParams["figure.figsize"]=(20,20)
    plt.imshow(img)

show_tree(dt, features,'dt_tree.png')

y_pred = c.predict(X_test)
y_pred

from sklearn.metrics import accuracy_score

score = accuracy_score(y_test, y_pred)*100
print("Accuracy: ",round(score,1),"%")

Thank you in advance

Upvotes: 2

Views: 2728

Answers (1)

Grr
Grr

Reputation: 16079

While decision trees are capable of handling categorical values, in sklearn you must binary encode them. For example your feature Husband's Occupation [1,2,3,4] should become three features each binary encoded for a given occupation value. You can do this in pandas with pd.get_dummies like so:

occ_dummies = pd.get_dummies(df["OccuH"], drop_first=True)
data = pd.concat([data.drop("OccuH", axis=1), occ_dummies], axis=1)

From there you can continue to use your data as you had previously.

I will make one point about the drop_first kwarg. The reason for using this is to avoid creating a linear dependency as explained in One-hot vs dummy encoding in Scikit-learn.

Upvotes: 3

Related Questions