Reputation: 41
I am trying to test my KNN classifier against some data that I sourced from UCI's Machine Learning Repository. When running the classifier I keep getting the same KeyError
train_set[i[-1]].append(i[:-1])
KeyError: NaN
I am not sure why this keeps happening because if I comment out the classifier and just print the first 10 lines or so, the data shows up fine with no corruption or duplication of any kind.
Here is what some of the code looks like:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
import warnings
from math import sqrt
from collections import Counter
import pandas as pd
import random
style.use('fivethirtyeight')
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features)-np.array(predict))
distances.append([euclidean_distance,group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('breast-cancer-wisconsin.data.txt')
df.replace('?',-99999, inplace=True)
df.drop(['id'], 1, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
test_size = 0.2
train_set = {2:[], 4:[]}
test_set = {2:[], 4:[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy:', correct/total)
I am completely stumped as to why this KeyError keeps showing up, (it also happens on the
test_set[i[-1]].append(i[:-1])
line as well.
I tried looking for people who experienced similar issues but have since found nobody with the same issue as me. As always any assistance is greatly appreciated, thank you.
Upvotes: 0
Views: 458
Reputation: 41
I figured out that the error was caused by a spacing issue. When typing in the classes for the data after I downloaded it, I forgot to input the classes on their own line. I instead typed my classes right in front of the first data point causing the error to occur.
Upvotes: 1