Reputation: 672
I read to datasets from file into numpy
arrays like this:
def read_data(filename):
data = np.empty(shape=[0, 65], dtype=int)
with open(filename) as f:
for line in f:
data = np.vstack((data, np.array(list(map(int, line.split(','))), dtype=int)))
return data
I use numpy
to calculate the euclidean distance between two lists:
def euclidean_distance(x, z):
return np.linalg.norm(x-z)
After this, I calculate the euclidean distances like this:
for data in testing_data:
for data2 in training_data:
dist = euclidean_distance(data, data2)
My problem is that this code runs very slowly, it takes about ~10 minutes to finish. How can I improve this, what am I missing?
I have to use the distances in another algorith, so the speed is very important.
Upvotes: 1
Views: 1547
Reputation: 16079
You could use sklearn.metrics.pairwise_distances
which allows you to allocate the work to all of your cores. Parallel construction of a distance matrix discusses the same topic and provides a good discussion on the differences of pdist
, cdist
, and pairwise_distances
If I understand your example correctly, you want the distance between each sample in the training set and each sample in the testing set. To do that you could use:
dist = pairwise_distances(training_data, testing_data, n_jobs=-1)
Upvotes: 2