Reputation: 12181
My assignment is to use a K-Nearest Neighbor algorithm to determine what kind of flower something is based on various features of it (e.g. stem length, petal length, etc.) using NumPy. (For the record, I've worked with Python in the past, although it's not my "best" language; however, I'm completely new to NumPy).
Both my training and my testing data are in CSVs that look like this:
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
I know how to do the basic algorithm. Here's the C# I created for it:
namespace Project_3_Prototype
{
public class FourD
{
public double f1, f2, f3, f4;
public string name;
public static double Distance(FourD a, FourD b)
{
double squared = Math.Pow(a.f1 - b.f1, 2) + Math.Pow(a.f2 - b.f2, 2) + Math.Pow(a.f3 - b.f3, 2) + Math.Pow(a.f4 - b.f4, 2);
return Math.Sqrt(squared);
}
}
class Program
{
static void Main(string[] args)
{
List<FourD> distances = new List<FourD>();
using (var parser = new TextFieldParser("iris-training-data.csv"))
{
parser.SetDelimiters(",");
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
var curr = new FourD
{
f1 = double.Parse(fields[0]),
f2 = double.Parse(fields[1]),
f3 = double.Parse(fields[2]),
f4 = double.Parse(fields[3]),
name = fields[4]
};
distances.Add(curr);
}
}
double correct = 0, total = 0;
using (var parser = new TextFieldParser("iris-testing-data.csv"))
{
parser.SetDelimiters(",");
int i = 1;
while (!parser.EndOfData)
{
total++;
string[] fields = parser.ReadFields();
var curr = new FourD
{
f1 = double.Parse(fields[0]),
f2 = double.Parse(fields[1]),
f3 = double.Parse(fields[2]),
f4 = double.Parse(fields[3]),
name = fields[4]
};
FourD min = distances[0];
foreach (FourD comp in distances)
{
if (FourD.Distance(comp, curr) < FourD.Distance(min, curr))
{
min = comp;
}
}
if (min.name == curr.name)
{
correct++;
}
Console.WriteLine(string.Format("{0},{1},{2}", i, curr.name, min.name));
i++;
}
}
Console.WriteLine("Accuracy: " + correct / total);
Console.ReadLine();
}
}
}
This is working exactly as expected, with the following output:
# The format is Number,Correct label,Predicted Label
1,Iris-setosa,Iris-setosa
2,Iris-setosa,Iris-setosa
3,Iris-setosa,Iris-setosa
4,Iris-setosa,Iris-setosa
5,Iris-setosa,Iris-setosa
6,Iris-setosa,Iris-setosa
7,Iris-setosa,Iris-setosa
8,Iris-setosa,Iris-setosa
9,Iris-setosa,Iris-setosa
10,Iris-setosa,Iris-setosa
11,Iris-setosa,Iris-setosa
12,Iris-setosa,Iris-setosa
...
Accuracy: 0.946666666666667
I'm trying to do the same thing in NumPy. However, the assignment does not permit me to use for
loops, only vectorized functions.
So, basically what I want to do is: for every row in the testing data, get the index of the row in the training data that's closest to it (i.e. has the minimum Euclidean distance).
Here's what I tried in Python:
import numpy as np
def main():
# Split each line of the CSV into a list of attributes and labels
data = [x.split(',') for x in open("iris-training-data.csv")]
# The last item is the label
labels = np.array([x[-1].rstrip() for x in data])
# Convert the first 3 items to a 2D array of floats
floats = np.array([x[0:3] for x in data]).astype(float)
classifyTrainingExamples(labels, floats)
def classifyTrainingExamples(labels, floats):
# We're basically doing the same thing to the testing data that we did to the training data
testingData = [x.split(',') for x in open("iris-testing-data.csv")]
testingLabels = np.array([x[-1].rstrip() for x in testingData])
testingFloats = np.array([x[0:3] for x in testingData]).astype(float)
res = np.apply_along_axis(lambda x: closest(floats, x), 1, testingFloats)
correct = 0
for number, index in enumerate(res):
if labels[index] == testingLabels[number]:
correct += 1
print("{},{},{}".format(number + 1, testingLabels[number], labels[index]))
number += 1
print(correct / len(list(res)))
def closest(otherArray, item):
res = np.apply_along_axis(lambda x: distance(x, item), 1, otherArray)
i = np.argmin(res)
return i
# Get the Euclidean distance between two "flat" lists (i.e. one particular row
def distance(a, b):
# Subtract one from the other elementwise, then raise each one to the power of 2
lst = (a - b) ** 2
# Sum all of the elements together, and take the square root
result = np.sqrt(lst.sum())
return result
main()
Unfortunately, the output looks like
1,Iris-setosa,Iris-setosa
2,Iris-setosa,Iris-setosa
3,Iris-setosa,Iris-setosa
4,Iris-setosa,Iris-setosa
....
74,Iris-setosa,Iris-setosa
75,Iris-setosa,Iris-setosa
0.93333333
Every single line has nothing but Iris-setosa
for the labels, and the accuracy is 0.9333333.
I tried stepping through this with a debugger, and every item is being counted as being correct by the if
statement (but the correctness percentage is still shown as 0.93333333).
So basically:
Iris-setosa
for every valueCan someone help me see what I'm missing here?
And before anyone asks, for the record, yes, I did try stepping through this with a debugger :) Also for the record, yes, this is homework.
Upvotes: 0
Views: 1298
Reputation: 91
If you really want to do it in one line, here is what you can do (I downloaded the dataset from scikit-learn):
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
# Load dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Split training and test set
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)
# 1-neareast neighbour
ypred = np.array([ytrain[np.argmin(np.sum((x-Xtrain)**2,axis=1))] for x in Xtest])
# Compute classification error
sum(ypred != ytest)/ len(ytest)
Now, this is 1-nearest neighbour, it only looks at the closest point from training set. For k-nearest neighbour, you have to change it to this:
# k-neareast neighbour
k = 3
ypredk = np.array([np.argmax(np.bincount(ytrain[np.argsort(np.sum((x-Xtrain)**2,axis=1))[0:k]])) for x in Xtest])
sum(ypredk != ytest)/ len(ytest)
To put it in words, you sort the distances, you find the indices of the k lowest values (that's the np.argsort
part) and the corresponding labels, then you look for the most common label among the k ones (that's the np.argmax(np.bincount(x))
part).
Finally, if you want to make sure, you can compare with scikit-learn
:
# scikit-learn NN
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=k, algorithm='ball_tree')
knn.fit(Xtrain,ytrain)
ypred_sklearn = knn.predict(Xtest)
sum(ypred_sklearn != ytest)/ len(ytest)
Upvotes: 2