Bryan McCormack
Bryan McCormack

Reputation: 105

What do the results on a Sci-Kit machine learning program represent?

I am working through Google's Machine Learning videos and completed a program that utilizes a database sotring info about flowers. The program runs successfully, but I'm having toruble understanding the results:

from scipy.spatial import distance
def euc(a,b):
    return distance.euclidean(a, b)

class ScrappyKNN():

    def fit(self, x_train, y_train):

        self.x_train = x_train

        self.y_train = y_train

   def predict(self, x_test):

        predictions = []

        for row in x_test:

            label = self.closest(row)

            predictions.append(label)

        return predictions

   def closest(self, row):

        best_dist = euc(row, self.x_train[0])

        best_index = 0

        for i in range(1, len(self.x_train)):

            dist = euc(row, self.x_train[i])

            if dist < best_dist:

                best_dist = dist

                best_index = i

        return self.y_train[best_index]

from sklearn import datasets

iris = datasets.load_iris()

x = iris.data

y = iris.target

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5)

print(x_train.shape, x_test.shape)

my_classifier = ScrappyKNN()

my_classifier .fit(x_train, y_train)

prediction = my_classifier.predict(x_test)



from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, prediction))

Results are as follows: (75, 4) (75, 4) 0.96

The 96% is the accuracy, but what exactly do the 75 and 4 represent?

Upvotes: 0

Views: 101

Answers (2)

Axois
Axois

Reputation: 2061

What it appears to me is that you are coding out the K Nearest Neighour from scratch using the euclidean metrics.

From your code x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5), what you are doing is to split the train and test data into 50% each. sklearn train-test-split actually splits the data by the rows, hence the features(number of columns) have to be the same. Hence (75,4) are your number of rows, followed by the number of features in the train set and test set respectively.

Now, the accuracy score of 0.96 basically means that, of your 75 rows in your test set, 96% are predicted correctly.

enter image description here

This compares the results from your test set and predicted set (the y_pred calculated from prediction = my_classifier.predict(x_test).)

TP, TN are the number of correct predictions while TP + TN + FP + FN basically sums up to 75 (total number of rows you are testing).

Note: When performing train-test-split its usually a good idea to split the data into 80/20 instead of 50/50, to give a better prediction.

Upvotes: 0

brentertainer
brentertainer

Reputation: 2198

You are printing the shapes of the datasets on this line:

print(x_train.shape, x_test.shape) 

Both x_train and x_test seem to have 75 rows (i.e. data points) and 4 columns (i.e. features) each. Unless you had an odd number of data points, these dimensions should be the same since you are performing a 50/50 training/testing data split on this line:

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5)

Upvotes: 1

Related Questions