mrQWERTY
mrQWERTY

Reputation: 4149

How to vectorize array operations in Python

I am writing a k nearest neighbor classifier in python. I am running into an issue where array operations take too long.

def classify(k, train_data, target):
    num_rows = train_data.shape[0]
    num_cols = train_data.shape[1]
    distances = []
    candidates = [0] * 10

    for i, row in enumerate(train_data):
        dist = euclidean_dist(target[:num_cols - 1], row[:num_cols - 1]) #slow
        distances.append((dist, row[num_cols - 1]))

    distances.sort(key=lambda tup: tup[0])
    distances = distances[:k]

    for i, d in enumerate(distances):
        candidates[d[1]] += 1

    return candidates.index(max(candidates))

def euclidean_dist(x1, x2):
    assert(len(x1) == len(x2))
    result = 0

    pdb.set_trace()
    for i in range(len(x1)): #culprit, x1 and x2 are both 256 length lists
        result += math.pow(x1[i] - x2[i], 2)
    result = math.sqrt(result)

    return result

I commented in the code above showing where the trouble occurs. Any suggestions to make this faster are welcome.

Upvotes: 3

Views: 125

Answers (1)

Simon
Simon

Reputation: 10150

Looks like you just want euclidean distance / 2norm, which you can get via numpy (imported as np) quite efficiently:

def euclidean_dist2(x1, x2):
    assert(len(x1) == len(x2))

    x1 = np.array(x1)
    x2 = np.array(x2)

    norm = np.linalg.norm(x1-x2)

    return norm

print euclidean_dist2([1,2],[4,7])

Which will give you 5.83095189485, the same as the function you had, but vectorized

Breaking it down, you're just taking the element wise difference, multiplying the resulting vector by itself (to square it), summing, then rooting the sum:

def euclidean_dist3(x1, x2):
    assert(len(x1) == len(x2))

    x1 = np.array(x1)
    x2 = np.array(x2)

    diff = x1 - x2

    squared = np.transpose(diff) * diff

    summed = sum(squared)

    norm = np.sqrt(summed)

    return norm

In other words, you're just taking the dot product of the difference vector with itself:

def euclidean_dist4(x1, x2):
    assert(len(x1) == len(x2))

    x1 = np.array(x1)
    x2 = np.array(x2)

    diff = x1 - x2

    dot = np.dot(diff, diff)

    norm = np.sqrt(dot)

    return norm

Different ways to achieve the same thing

Upvotes: 2

Related Questions