Ivan
Ivan

Reputation: 703

Difference between Python's sklearn (DecisionTreeClassifier, SVM)?

I am new to Machine Learning - specifically, Classification techniques.

I have read a few tutorials online and I'm using the iris data set. I tried splitting the data set into train and test using

train, test = train_test_split(df,
                               test_size=test_size,
                               train_size=train_size,
                               random_state=random_state)

Subsequently, I found 2 ways to fit the model (DecisionTreeClassifier & SVM):

dt = DecisionTreeClassifier(min_samples_split=20, random_state=99)
clf = svm.SVC(kernel='linear', C=1)

Both models allow me to use .fit() and .score() methods. I tried resampling data with different sizes and random states but I am getting the exact same score of 0.9852 with the 2 models. Am I doing something wrong?

Also, is there a need to convert my target variable ("class") to numeric values as stated here? I have tried fitting the data frame with the original string values and I am getting the same results. Any help is greatly appreciated!

Upvotes: 1

Views: 432

Answers (1)

seralouk
seralouk

Reputation: 33147

The right way to use train_test_split is something like the following:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

#Load Iris data, X: features and y:target/labels
df = load_iris()
y = df.target
X = df.data

#Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state= 99)

#Fit the 2 classifiers
dt = DecisionTreeClassifier(min_samples_split=20, random_state=99)
clf = SVC(kernel='linear', C=1)

dt.fit(X_train, y_train)
y_predicted_dt = dt.predict(X_test)
scores_dt = accuracy_score(y_test, y_predicted_dt)
print(scores_dt)

clf.fit(X_train, y_train)
y_predicted_clf = clf.predict(X_test)
scores_clf = accuracy_score(y_test, y_predicted_clf)
print(scores_clf)

Results:

#Accuracy of dt classifier
0.933333333333

#Accuracy of clf classifier
0.983333333333

Bottom line:

In your case, you only pass as X the df in the train_test_split.

You do not need to convert thr classes. Just use accuracy_score or cross_val_score functions.

Upvotes: 0

Related Questions