Reputation: 105
I am working through Google's Machine Learning videos and completed a program that utilizes a database sotring info about flowers. The program runs successfully, but I'm having toruble understanding the results:
from scipy.spatial import distance
def euc(a,b):
return distance.euclidean(a, b)
class ScrappyKNN():
def fit(self, x_train, y_train):
self.x_train = x_train
self.y_train = y_train
def predict(self, x_test):
predictions = []
for row in x_test:
label = self.closest(row)
predictions.append(label)
return predictions
def closest(self, row):
best_dist = euc(row, self.x_train[0])
best_index = 0
for i in range(1, len(self.x_train)):
dist = euc(row, self.x_train[i])
if dist < best_dist:
best_dist = dist
best_index = i
return self.y_train[best_index]
from sklearn import datasets
iris = datasets.load_iris()
x = iris.data
y = iris.target
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5)
print(x_train.shape, x_test.shape)
my_classifier = ScrappyKNN()
my_classifier .fit(x_train, y_train)
prediction = my_classifier.predict(x_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, prediction))
Results are as follows: (75, 4) (75, 4) 0.96
The 96% is the accuracy, but what exactly do the 75 and 4 represent?
Upvotes: 0
Views: 101
Reputation: 2061
What it appears to me is that you are coding out the K Nearest Neighour from scratch using the euclidean metrics.
From your code x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5)
, what you are doing is to split the train
and test
data into 50% each. sklearn train-test-split
actually splits the data by the rows, hence the features(number of columns) have to be the same. Hence (75,4)
are your number of rows, followed by the number of features in the train set and test set respectively.
Now, the accuracy score of 0.96 basically means that, of your 75 rows in your test set, 96% are predicted correctly.
This compares the results from your test set and predicted set (the y_pred calculated from prediction = my_classifier.predict(x_test)
.)
TP, TN are the number of correct predictions while TP + TN + FP + FN basically sums up to 75 (total number of rows you are testing).
Note: When performing train-test-split
its usually a good idea to split the data into 80/20 instead of 50/50, to give a better prediction.
Upvotes: 0
Reputation: 2198
You are printing the shapes of the datasets on this line:
print(x_train.shape, x_test.shape)
Both x_train
and x_test
seem to have 75 rows (i.e. data points) and 4 columns (i.e. features) each. Unless you had an odd number of data points, these dimensions should be the same since you are performing a 50/50 training/testing data split on this line:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size =.5)
Upvotes: 1