NumPy implementation of the Nearest Neighbor classification algorithm classifies everything the exact same way

My assignment is to use a K-Nearest Neighbor algorithm to determine what kind of flower something is based on various features of it (e.g. stem length, petal length, etc.) using NumPy. (For the record, I've worked with Python in the past, although it's not my "best" language; however, I'm completely new to NumPy).

Both my training and my testing data are in CSVs that look like this:

4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor

I know how to do the basic algorithm. Here's the C# I created for it:

namespace Project_3_Prototype
{
    public class FourD
    {
        public double f1, f2, f3, f4;

        public string name;

        public static double Distance(FourD a, FourD b)
        {
            double squared = Math.Pow(a.f1 - b.f1, 2) + Math.Pow(a.f2 - b.f2, 2) + Math.Pow(a.f3 - b.f3, 2) + Math.Pow(a.f4 - b.f4, 2);

            return Math.Sqrt(squared);
        }
    }

    class Program
    {
        static void Main(string[] args)
        {
            List<FourD> distances = new List<FourD>();

            using (var parser = new TextFieldParser("iris-training-data.csv"))
            {
                parser.SetDelimiters(",");

                while (!parser.EndOfData)
                {
                    string[] fields = parser.ReadFields();

                    var curr = new FourD
                    {
                        f1 = double.Parse(fields[0]),
                        f2 = double.Parse(fields[1]),
                        f3 = double.Parse(fields[2]),
                        f4 = double.Parse(fields[3]),
                        name = fields[4]
                    };

                    distances.Add(curr);
                }
            }

            double correct = 0, total = 0;

            using (var parser = new TextFieldParser("iris-testing-data.csv"))
            {
                parser.SetDelimiters(",");

                int i = 1;

                while (!parser.EndOfData)
                {
                    total++;
                    string[] fields = parser.ReadFields();

                    var curr = new FourD
                    {
                        f1 = double.Parse(fields[0]),
                        f2 = double.Parse(fields[1]),
                        f3 = double.Parse(fields[2]),
                        f4 = double.Parse(fields[3]),
                        name = fields[4]
                    };

                    FourD min = distances[0];

                    foreach (FourD comp in distances)
                    {
                        if (FourD.Distance(comp, curr) < FourD.Distance(min, curr))
                        {
                            min = comp;
                        }
                    }

                    if (min.name == curr.name)
                    {
                        correct++;
                    }

                    Console.WriteLine(string.Format("{0},{1},{2}", i, curr.name, min.name));

                    i++;
                }
            }

            Console.WriteLine("Accuracy: " + correct / total);

            Console.ReadLine();
        }
    }
}

This is working exactly as expected, with the following output:

# The format is Number,Correct label,Predicted Label
1,Iris-setosa,Iris-setosa
2,Iris-setosa,Iris-setosa
3,Iris-setosa,Iris-setosa
4,Iris-setosa,Iris-setosa
5,Iris-setosa,Iris-setosa
6,Iris-setosa,Iris-setosa
7,Iris-setosa,Iris-setosa
8,Iris-setosa,Iris-setosa
9,Iris-setosa,Iris-setosa
10,Iris-setosa,Iris-setosa
11,Iris-setosa,Iris-setosa
12,Iris-setosa,Iris-setosa
...

Accuracy: 0.946666666666667

I'm trying to do the same thing in NumPy. However, the assignment does not permit me to use for loops, only vectorized functions.

So, basically what I want to do is: for every row in the testing data, get the index of the row in the training data that's closest to it (i.e. has the minimum Euclidean distance).

Here's what I tried in Python:

import numpy as np

def main():    
    # Split each line of the CSV into a list of attributes and labels
    data = [x.split(',') for x in open("iris-training-data.csv")]

    # The last item is the label
    labels = np.array([x[-1].rstrip() for x in data])

    # Convert the first 3 items to a 2D array of floats
    floats = np.array([x[0:3] for x in data]).astype(float)

    classifyTrainingExamples(labels, floats)

def classifyTrainingExamples(labels, floats):
    # We're basically doing the same thing to the testing data that we did to the training data
    testingData = [x.split(',') for x in open("iris-testing-data.csv")]

    testingLabels = np.array([x[-1].rstrip() for x in testingData])

    testingFloats = np.array([x[0:3] for x in testingData]).astype(float)

    res = np.apply_along_axis(lambda x: closest(floats, x), 1, testingFloats)

    correct = 0

    for number, index in enumerate(res):    
        if labels[index] == testingLabels[number]:
            correct += 1

        print("{},{},{}".format(number + 1, testingLabels[number], labels[index]))

        number += 1

    print(correct / len(list(res)))

def closest(otherArray, item):
    res = np.apply_along_axis(lambda x: distance(x, item), 1, otherArray)

    i = np.argmin(res)

    return i

# Get the Euclidean distance between two "flat" lists (i.e. one particular row
def distance(a, b):
    # Subtract one from the other elementwise, then raise each one to the power of 2
    lst = (a - b) ** 2

    # Sum all of the elements together, and take the square root
    result = np.sqrt(lst.sum())

    return result

main()

Unfortunately, the output looks like

1,Iris-setosa,Iris-setosa
2,Iris-setosa,Iris-setosa
3,Iris-setosa,Iris-setosa
4,Iris-setosa,Iris-setosa
....
74,Iris-setosa,Iris-setosa
75,Iris-setosa,Iris-setosa
0.93333333

Every single line has nothing but Iris-setosa for the labels, and the accuracy is 0.9333333.

I tried stepping through this with a debugger, and every item is being counted as being correct by the if statement (but the correctness percentage is still shown as 0.93333333).

So basically:

Can someone help me see what I'm missing here?

And before anyone asks, for the record, yes, I did try stepping through this with a debugger :) Also for the record, yes, this is homework.

Upvotes: 0

Views: 1298

Answers (1)

Abby
Abby

Reputation: 91

If you really want to do it in one line, here is what you can do (I downloaded the dataset from scikit-learn):

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

# Load dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Split training and test set
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)
# 1-neareast neighbour    
ypred = np.array([ytrain[np.argmin(np.sum((x-Xtrain)**2,axis=1))] for x in Xtest])
# Compute classification error
sum(ypred != ytest)/ len(ytest)

Now, this is 1-nearest neighbour, it only looks at the closest point from training set. For k-nearest neighbour, you have to change it to this:

# k-neareast neighbour    
k = 3
ypredk = np.array([np.argmax(np.bincount(ytrain[np.argsort(np.sum((x-Xtrain)**2,axis=1))[0:k]])) for x in Xtest])
sum(ypredk != ytest)/ len(ytest)

To put it in words, you sort the distances, you find the indices of the k lowest values (that's the np.argsort part) and the corresponding labels, then you look for the most common label among the k ones (that's the np.argmax(np.bincount(x)) part).

Finally, if you want to make sure, you can compare with scikit-learn:

# scikit-learn NN
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=k, algorithm='ball_tree')
knn.fit(Xtrain,ytrain)
ypred_sklearn = knn.predict(Xtest)
sum(ypred_sklearn != ytest)/ len(ytest)

Upvotes: 2

Related Questions