Sean
Sean

Reputation: 3450

Sklearn DecisionTreeClassifier F-Score Different Results with Each run

I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler() to scale the data, and f1_score for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.

data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.

The following code is what I did to preprocess and format my data:

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score


# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")

X = data.iloc[:, :-1]
y = data.iloc[:, 6]

# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)

for i in range(len(categorical_data)):
    X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])


# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)


The next code is for the actual decision tree model training:

dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')

print("Score is = {}".format(score))


The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.

On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."

I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:


  1. Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?

  2. I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?

Thank you.

Upvotes: 1

Views: 1734

Answers (1)

Naveen
Naveen

Reputation: 1210

You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.

In order to replicate the result each time you run, use random_state parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.

#train test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)

#Decision tree model
dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)

Upvotes: 3

Related Questions