Fogarasi Norbert
Fogarasi Norbert

Reputation: 672

Calculating euclidean distances with Python runs too slow

I read to datasets from file into numpy arrays like this:

def read_data(filename):
   data = np.empty(shape=[0, 65], dtype=int)
   with open(filename) as f:
       for line in f:
           data = np.vstack((data, np.array(list(map(int, line.split(','))), dtype=int)))
   return data

I use numpy to calculate the euclidean distance between two lists:

def euclidean_distance(x, z):
   return np.linalg.norm(x-z)

After this, I calculate the euclidean distances like this:

for data in testing_data:
   for data2 in training_data:
       dist = euclidean_distance(data, data2)

My problem is that this code runs very slowly, it takes about ~10 minutes to finish. How can I improve this, what am I missing?
I have to use the distances in another algorith, so the speed is very important.

Upvotes: 1

Views: 1547

Answers (1)

Grr
Grr

Reputation: 16079

You could use sklearn.metrics.pairwise_distances which allows you to allocate the work to all of your cores. Parallel construction of a distance matrix discusses the same topic and provides a good discussion on the differences of pdist, cdist, and pairwise_distances

If I understand your example correctly, you want the distance between each sample in the training set and each sample in the testing set. To do that you could use:

dist = pairwise_distances(training_data, testing_data, n_jobs=-1)

Upvotes: 2

Related Questions