Reputation: 545
Suppose I have some nearest neighbor classifier. For a new observation it computes the distance between the new observation and all observations in the "known" data set. It returns the class label of the observation, that has the smallest distance to the new observation.
import numpy as np
known_obs = np.random.randint(0, 10, 40).reshape(8, 5)
new_obs = np.random.randint(0, 10, 80).reshape(16, 5)
labels = np.random.randint(0, 2, 8).reshape(8, )
def my_dist(x1, known_obs, axis=0):
return (np.square(np.linalg.norm(x1 - known_obs, axis=axis)))
def nn_classifier(n, known_obs, labels, axis=1, distance=my_dist):
return labels[np.argmin(distance(n, known_obs, axis=axis))]
def classify_batch(new_obs, known_obs, labels, classifier=nn_classifier, distance=my_dist):
return [classifier(n, known_obs, labels, distance=distance) for n in new_obs]
print(classify_batch(new_obs, known_obs, labels, nn_classifier, my_dist))
For performance reasons I would like to avoid the for loop in the classify_batch function. Is there a way to use numpy operations to apply the nn_classifier function to each row of new_obs? I already tried apply_along_axis but as often mentioned it is convenient but not fast.
Upvotes: 0
Views: 2367
Reputation: 231395
The key to avoiding the loop is to express the action on the (16,8) array of 'distances'. The labels[]
and argmin
steps just cloud the issue.
If I set labels = np.arange(8)
, then this
arr = np.array([my_dist(n, known_obs, axis=1) for n in new_obs])
print(arr)
print(np.argmin(arr, axis=1))
produces the same thing. It still has a list comprehension, but we are closer to 'source'.
[[ 32. 115. 22. 116. 162. 86. 161. 117.]
[ 106. 31. 142. 164. 92. 106. 45. 103.]
[ 44. 135. 94. 18. 94. 50. 87. 135.]
[ 11. 92. 57. 67. 79. 43. 118. 106.]
[ 40. 67. 126. 98. 50. 74. 75. 175.]
[ 78. 61. 120. 148. 102. 128. 67. 191.]
[ 51. 48. 57. 133. 125. 35. 110. 14.]
[ 47. 28. 93. 91. 63. 49. 32. 88.]
[ 61. 86. 23. 141. 159. 85. 146. 22.]
[ 131. 70. 155. 149. 129. 127. 44. 138.]
[ 97. 138. 87. 117. 223. 77. 130. 122.]
[ 151. 78. 211. 161. 131. 115. 46. 164.]
[ 13. 50. 31. 69. 59. 43. 80. 40.]
[ 131. 108. 157. 161. 207. 85. 102. 146.]
[ 39. 106. 67. 23. 61. 67. 70. 88.]
[ 54. 51. 74. 68. 42. 86. 35. 65.]]
[2 1 3 0 0 1 7 1 7 6 5 6 0 5 3 6]
With
print((new_obs[:,None,:] - known_obs[None,:,:]).shape)
I get a (16,8,5) array. So can I apply the linalg.norm
on the last axis?
This seems to do the trick
np.square(np.linalg.norm(diff, axis=-1))
So together:
diff = (new_obs[:,None,:] - known_obs[None,:,:])
dist = np.square(np.linalg.norm(diff, axis=-1))
idx = np.argmin(dist, axis=1)
print(idx)
Upvotes: 1