Reputation: 173
I am testing several ML classification models, in this case Support Vector Machines. I have basic knowledge about the SVM algorithm and how it works.
I am using the built-in breast cancer dataset from scikit learn.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
Using the code below:
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,
stratify=cancer.target, random_state=42)
clf2 = LinearSVC(C=0.01).fit(X_train, y_train)
clf3 = LinearSVC(C=0.1).fit(X_train, y_train)
clf4 = LinearSVC(C=1).fit(X_train, y_train)
clf5 = LinearSVC(C=10).fit(X_train, y_train)
clf6 = LinearSVC(C=100).fit(X_train, y_train)
When printing the scores as in:
print("Model training score with C=0.01:\n{:.3f}".format(clf2.score(X_train, y_train)))
print("Model testing score with C=0.01:\n{:.3f}".format(clf2.score(X_test, y_test)))
print("------------------------------")
print("Model training score with C=0.1:\n{:.3f}".format(clf3.score(X_train, y_train)))
print("Model testing score with C=0.1:\n{:.3f}".format(clf3.score(X_test, y_test)))
print("------------------------------")
print("Model training score with C=1:\n{:.3f}".format(clf4.score(X_train, y_train)))
print("Model testing score with C=1:\n{:.3f}".format(clf4.score(X_test, y_test)))
print("------------------------------")
print("Model training score with C=10:\n{:.3f}".format(clf5.score(X_train, y_train)))
print("Model testing score with C=10:\n{:.3f}".format(clf5.score(X_test, y_test)))
print("------------------------------")
print("Model training score with C=100:\n{:.3f}".format(clf6.score(X_train, y_train)))
print("Model testing score with C=100:\n{:.3f}".format(clf6.score(X_test, y_test)))
When I run this code, I get certain scores per different regularization parameter C. When I would run the .fit lines again (aka train them again), these scores turn out completely different. Sometimes they are even way different (e.g. 72% vs. 90% for the same value of C).
Where does this variability come from? I thought that, assuming I use the same random_state parameter, it would always find the same support vectors and hence would give me the same results, but since the score changes when I train the model another time, this is not the case. In logistic regression, for instance, the scores are always consistent, no matter if I run the fit. code again.
Explaining this variability in accuracy score would be of much help!
Upvotes: 1
Views: 459
Reputation: 33147
Of course. You need to fix the random_state=None
to a specific seed so that you can reproduce the results.
Otherwise, you use the default random_state=None
and thus, every time you call the commands, a random seed is used and this is why you get this variability.
Use:
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target,
stratify=cancer.target, random_state=42)
clf2 = LinearSVC(C=0.01,random_state=42).fit(X_train, y_train)
clf3 = LinearSVC(C=0.1, random_state=42).fit(X_train, y_train)
clf4 = LinearSVC(C=1, random_state=42).fit(X_train, y_train)
clf5 = LinearSVC(C=10, random_state=42).fit(X_train, y_train)
clf6 = LinearSVC(C=100, random_state=42).fit(X_train, y_train)
Upvotes: 1