nhoeft
nhoeft

Reputation: 545

Fast way to apply function to each row of a numpy array

Suppose I have some nearest neighbor classifier. For a new observation it computes the distance between the new observation and all observations in the "known" data set. It returns the class label of the observation, that has the smallest distance to the new observation.

import numpy as np

known_obs = np.random.randint(0, 10, 40).reshape(8, 5)
new_obs = np.random.randint(0, 10, 80).reshape(16, 5)
labels = np.random.randint(0, 2, 8).reshape(8, )

def my_dist(x1, known_obs, axis=0):
    return (np.square(np.linalg.norm(x1 - known_obs, axis=axis)))

def nn_classifier(n, known_obs, labels, axis=1, distance=my_dist):
    return labels[np.argmin(distance(n, known_obs, axis=axis))]

def classify_batch(new_obs, known_obs, labels, classifier=nn_classifier, distance=my_dist):
    return [classifier(n, known_obs, labels, distance=distance) for n in new_obs]

print(classify_batch(new_obs, known_obs, labels, nn_classifier, my_dist))

For performance reasons I would like to avoid the for loop in the classify_batch function. Is there a way to use numpy operations to apply the nn_classifier function to each row of new_obs? I already tried apply_along_axis but as often mentioned it is convenient but not fast.

Upvotes: 0

Views: 2367

Answers (1)

hpaulj
hpaulj

Reputation: 231395

The key to avoiding the loop is to express the action on the (16,8) array of 'distances'. The labels[] and argmin steps just cloud the issue.

If I set labels = np.arange(8), then this

arr = np.array([my_dist(n, known_obs, axis=1) for n in new_obs])
print(arr)
print(np.argmin(arr, axis=1))

produces the same thing. It still has a list comprehension, but we are closer to 'source'.

[[  32.  115.   22.  116.  162.   86.  161.  117.]
 [ 106.   31.  142.  164.   92.  106.   45.  103.]
 [  44.  135.   94.   18.   94.   50.   87.  135.]
 [  11.   92.   57.   67.   79.   43.  118.  106.]
 [  40.   67.  126.   98.   50.   74.   75.  175.]
 [  78.   61.  120.  148.  102.  128.   67.  191.]
 [  51.   48.   57.  133.  125.   35.  110.   14.]
 [  47.   28.   93.   91.   63.   49.   32.   88.]
 [  61.   86.   23.  141.  159.   85.  146.   22.]
 [ 131.   70.  155.  149.  129.  127.   44.  138.]
 [  97.  138.   87.  117.  223.   77.  130.  122.]
 [ 151.   78.  211.  161.  131.  115.   46.  164.]
 [  13.   50.   31.   69.   59.   43.   80.   40.]
 [ 131.  108.  157.  161.  207.   85.  102.  146.]
 [  39.  106.   67.   23.   61.   67.   70.   88.]
 [  54.   51.   74.   68.   42.   86.   35.   65.]]
[2 1 3 0 0 1 7 1 7 6 5 6 0 5 3 6]

With

print((new_obs[:,None,:] - known_obs[None,:,:]).shape)

I get a (16,8,5) array. So can I apply the linalg.norm on the last axis?

This seems to do the trick

np.square(np.linalg.norm(diff, axis=-1))

So together:

diff = (new_obs[:,None,:] - known_obs[None,:,:])
dist = np.square(np.linalg.norm(diff, axis=-1))
idx = np.argmin(dist, axis=1)
print(idx)

Upvotes: 1

Related Questions