Most efficient way to iterate over large vector?

Question

I have an input ndarray, pointsCount, with shape (4000000, 1). I have another ndarray, clusters, with shape (2,1). I then want to perform the following:

distances = np.zeros((pointsCount, n_clusters))
for x in range(len(trainPoints)):
    for c in range(len(clusters)):
        distances[x,c] = (trainPoints[x]-clusters[c]).T@(trainPoints[x]-clusters[c])

However, this takes ages to complete. The same is true for the list comprehension distances = np.array([(x-cluster).T@(x-cluster) for x in trainPoints for cluster in clusters]).reshape((4000000, 2)).

Any way that I can perform this faster using numpy?

Sayandip Dutta · Accepted Answer

All you need to do is transpose clusters. For example, given initial arrays:

>>> pointsCount    # I have considered 4 instead of 4 mil
array([[2],
       [4],
       [7],
       [6]])
>>> clusters
array([[2],
       [3]])
# Your code:
>>> np.array([(x-cluster).T@(x-cluster) for x in pointsCount for cluster in clusters]).reshape((4, 2))
array([[ 0,  1],
       [ 4,  1],
       [25, 16],
       [16,  9]])

# Faster code:
>>> (pointsCount - clusters.T)**2 
array([[ 0,  1],
       [ 4,  1],
       [25, 16],
       [16,  9]], dtype=int32)

You may want to take a look at NumPy Broadcasting

Most efficient way to iterate over large vector?

Answers (1)

Related Questions