How to vectorize array operations in Python

Question

I am writing a k nearest neighbor classifier in python. I am running into an issue where array operations take too long.

def classify(k, train_data, target):
    num_rows = train_data.shape[0]
    num_cols = train_data.shape[1]
    distances = []
    candidates = [0] * 10

    for i, row in enumerate(train_data):
        dist = euclidean_dist(target[:num_cols - 1], row[:num_cols - 1]) #slow
        distances.append((dist, row[num_cols - 1]))

    distances.sort(key=lambda tup: tup[0])
    distances = distances[:k]

    for i, d in enumerate(distances):
        candidates[d[1]] += 1

    return candidates.index(max(candidates))

def euclidean_dist(x1, x2):
    assert(len(x1) == len(x2))
    result = 0

    pdb.set_trace()
    for i in range(len(x1)): #culprit, x1 and x2 are both 256 length lists
        result += math.pow(x1[i] - x2[i], 2)
    result = math.sqrt(result)

    return result

I commented in the code above showing where the trouble occurs. Any suggestions to make this faster are welcome.

Simon · Accepted Answer

Looks like you just want euclidean distance / 2norm, which you can get via numpy (imported as np) quite efficiently:

def euclidean_dist2(x1, x2):
    assert(len(x1) == len(x2))

    x1 = np.array(x1)
    x2 = np.array(x2)

    norm = np.linalg.norm(x1-x2)

    return norm

print euclidean_dist2([1,2],[4,7])

Which will give you 5.83095189485, the same as the function you had, but vectorized

Breaking it down, you're just taking the element wise difference, multiplying the resulting vector by itself (to square it), summing, then rooting the sum:

def euclidean_dist3(x1, x2):
    assert(len(x1) == len(x2))

    x1 = np.array(x1)
    x2 = np.array(x2)

    diff = x1 - x2

    squared = np.transpose(diff) * diff

    summed = sum(squared)

    norm = np.sqrt(summed)

    return norm

In other words, you're just taking the dot product of the difference vector with itself:

def euclidean_dist4(x1, x2):
    assert(len(x1) == len(x2))

    x1 = np.array(x1)
    x2 = np.array(x2)

    diff = x1 - x2

    dot = np.dot(diff, diff)

    norm = np.sqrt(dot)

    return norm

Different ways to achieve the same thing

How to vectorize array operations in Python

Answers (1)

Related Questions