user1917470
user1917470

Reputation: 13

Decision boundary changing in sklearn each time I run code

In Udacity's Intro to Machine Learning class, I am finding that the result of my code can change each time I run it. The correct values are acc_min_samples_split_2 = .908 and acc_min_samples_split_2 = .912, but when I run my script, sometimes the value for acc_min_samples_split_2 = .912 as well. This happens on both my local machine and the web interface within Udacity. Why might this be happening?

The program uses the SciKit Learn library for python. Here is the part of the code that I wrote:

def classify(features, labels, samples):
        # Creates a new Decision Tree Classifier, and fits it based on sample data 
        # and a specified min_sample_split value
    from sklearn import tree
    clf = tree.DecisionTreeClassifier(min_samples_split = samples)
    clf = clf.fit(features, labels)
    return clf

#Create a classifier with a min sample split of 2, and test its accuracy
clf2 = classify(features_train, labels_train, 2)
acc_min_samples_split_2 = clf2.score(features_test,labels_test)

#Create a classifier with a min sample split of 50, and test its accuracy
clf50 = classify(features_train, labels_train, 50)
acc_min_samples_split_50 = clf50.score(features_test,labels_test)

def submitAccuracies():
    return {"acc_min_samples_split_2":round(acc_min_samples_split_2,3),
      "acc_min_samples_split_50":round(acc_min_samples_split_50,3)}
print submitAccuracies()

Upvotes: 1

Views: 1275

Answers (1)

sascha
sascha

Reputation: 33542

Some classifiers within scikit-learn are of stochastic nature using some PRNG to generate random-numbers internally.

DecisionTree is one of them. Check the docs and use the argument random_state to make that random-behaviour deterministic.

Just create your fit-object like:

clf = tree.DecisionTreeClassifier(min_samples_split = samples, random_state=0)  # or any other constant

If you don't provide a random_state or some seed/integer like in my example above, the PRNG will be seeded by some external source (most probably based on system-time) resulting in different results across runs of that script.*

Two runs, sharing the code and given constant will behave equal (ignoring some pathological architecture/platform stuff).

Upvotes: 2

Related Questions