Reputation: 109

Different results on the same dataset in machine learning

I use the scikit-learn library for the machine learning (with text data). It looks like this:

    vectorizer = TfidfVectorizer(analyzer='word', tokenizer=nltk.word_tokenize, stop_words=stop_words).fit(train)
    matr_train = vectorizer.transform(train)
    X_train = matr_train.toarray()
    matr_test = vectorizer.transform(test)
    X_test = matr_test.toarray()
    rfc = RandomForestClassifier()
    rfc.fit(X_train, y_train)
    y_predict = rfc.predict(X_test)

When I run it for the first time, the result for the test dataset is 0.17 for the recall and 1.00 for the precision. Ok. But when I run it for the second time on this test dataset and this training dataset the result is different - 0.23 for the recall and 1.00 for the precision. And when I'll run it for the next times the result will be different. At the same time the precision and the recall for the training dataset are one and the same.

Why does it happen? Maybe this fact refers to something about my data?

Thanks.

Upvotes: 4

Answers (2)

Irshad Bhat

Reputation: 8709

A random forest fits a number of decision tree classifiers on various sub-samples of the dataset. Every time you call the classifier, sub-samples are randomly generated and thus different results. In order to control this thing you need to set a parameter called random_state.

rfc = RandomForestClassifier(random_state=137)

Note that random_state is the seed used by the random number generator. You can use any integer to set this parameter. Whenever you change the random_state value the results are likely to change. But as long as you use the same value for random_state you will get the same results.

The random_state parameter is used in various other classifiers as well. For example in Neural Networks we use random_state in order to fix initial weight vectors for every run of the classifier. This helps in tuning other hyper-parameters like learning rate, weight decay etc. If we don't set the random_state, we are not sure whether the performance change is due to the change in hyper-parameters or due to change in initial weight vectors. Once we tune the hyper-parameters we can change the random_state to further improve the performance of the model.

Upvotes: 6

Chris

Reputation: 967

The clue is (at least partly) in the name.

A Random Forest uses randomised decision trees, and as such, each time you fit, the result will change.

https://www.quora.com/How-does-randomization-in-a-random-forest-work

Upvotes: 1

Different results on the same dataset in machine learning

Answers (2)

Related Questions