Reputation: 4149
I am writing a k nearest neighbor classifier in python. I am running into an issue where array operations take too long.
def classify(k, train_data, target):
num_rows = train_data.shape[0]
num_cols = train_data.shape[1]
distances = []
candidates = [0] * 10
for i, row in enumerate(train_data):
dist = euclidean_dist(target[:num_cols - 1], row[:num_cols - 1]) #slow
distances.append((dist, row[num_cols - 1]))
distances.sort(key=lambda tup: tup[0])
distances = distances[:k]
for i, d in enumerate(distances):
candidates[d[1]] += 1
return candidates.index(max(candidates))
def euclidean_dist(x1, x2):
assert(len(x1) == len(x2))
result = 0
pdb.set_trace()
for i in range(len(x1)): #culprit, x1 and x2 are both 256 length lists
result += math.pow(x1[i] - x2[i], 2)
result = math.sqrt(result)
return result
I commented in the code above showing where the trouble occurs. Any suggestions to make this faster are welcome.
Upvotes: 3
Views: 125
Reputation: 10150
Looks like you just want euclidean distance / 2norm, which you can get via numpy (imported as np
) quite efficiently:
def euclidean_dist2(x1, x2):
assert(len(x1) == len(x2))
x1 = np.array(x1)
x2 = np.array(x2)
norm = np.linalg.norm(x1-x2)
return norm
print euclidean_dist2([1,2],[4,7])
Which will give you 5.83095189485, the same as the function you had, but vectorized
Breaking it down, you're just taking the element wise difference, multiplying the resulting vector by itself (to square it), summing, then rooting the sum:
def euclidean_dist3(x1, x2):
assert(len(x1) == len(x2))
x1 = np.array(x1)
x2 = np.array(x2)
diff = x1 - x2
squared = np.transpose(diff) * diff
summed = sum(squared)
norm = np.sqrt(summed)
return norm
In other words, you're just taking the dot product of the difference vector with itself:
def euclidean_dist4(x1, x2):
assert(len(x1) == len(x2))
x1 = np.array(x1)
x2 = np.array(x2)
diff = x1 - x2
dot = np.dot(diff, diff)
norm = np.sqrt(dot)
return norm
Different ways to achieve the same thing
Upvotes: 2