Bayes
Bayes

Reputation: 45

Efficient algorithm to calculate pairwise distances for a large set of locations in matlab

I recently got a problem when calculating the spherical distance with large set of locations in matlab. Here is the problem:

1.locations are latitudes and longitudes, stored in a vector locs with dimension nx2 where n = 10^5.

2.I want to calculate the spherical distance between any two locations in the vector locs. Note that the resulted distance matrix may or may not sparse, since I cannot check it without distance. Let's assume that it's sparse, because we can truncate the distance with a number, say 1000. (The units are km), and the resulted distance matrix is enough for my usage.

3.I tried two methods to calculate it, but each of them has drawbacks.

PS: I run the code on Mac OS with MATLAB 2014 student version. The function great_circle_distance() is pretty like Matlab built-in function calculating the great circle distance between any two locations with geospatial coordinates. The problem here is not relevant to usage of this function.

Thanks for all kinds of suggestions in advance.

--Disc

Method1:

dist = zeros(n, 1);
dist_vec = [];
dist_id = [];
dist_id2 = [];
for i =1:n
dist = great_circle_distance(locs(i, :), locs);
dist_temp=(dist<300) &(dist>0);   % for example, consider the distance between 0 and 300
[dist_id_temp, dist_id2_temp,dist_vtemp]=find(dist_temp);
dist_vec_temp=dist(dist_id_temp);
dist_id=[dist_id; dist_id_temp];
dist_id2=[dist_id2; dist_id2_temp];
dist_vec=[dist_vec; dist_vec_temp];

if mod(i, 1000) == 0

    i

end
end
dist_mat = sparse(dist_id, dist_id2, dist_vec);

The drawback here is that it took long time to finish this, at least 48 hours, but it consumed not too much memory around 2-5G.

Method2:

d = pdist(locs, @great_circle_distance);
[i, j] = find(tril(true(n), -1)); % extract index below main diagonal
d = d';
a = [i, j, d];
con = d == 0;   % specify condition. 
a(con, :) = [];   % delete rows satisfying condition con.
clear i j d;
i = a(:, 1);
j = a(:, 2);
d = a(:, 3);
dist_mat = sparse(i, j, d);

The drawback of this method is that it consumed too much memory (exceeding 15G) when calculating d in the code.

Upvotes: 1

Views: 2612

Answers (1)

Amro
Amro

Reputation: 124563

1) Method #1

You are being very inefficient in the way you build the intermediate variables (arrays changing size and such). Consider this improved implementation:

% given some data (longitude/latitude)
locs = rand(N,2);    % N = 10^5 in your case

% pick an appropriate size (guesstimate the number of nonzeros in the sparse mat)
nzmx = ...;

% compute pairwise-distance matrix (only those satisfying some condition)
D = spalloc(N, N, nzmx);
for i=1:N
    d = great_circle_distance(locs(i,:), locs);
    idx = find(100<d & d<400);   % distances in the range [100,400] km
    D(idx,i) = d(idx);
end

Note that for N=10^5 entries, the full distance matrix of type double would consume: N*N*8 bytes, or approximately ((10^5)^2*8)/2^30 = 74.5 GB of memory (that's gigabytes)! So obviously I'm assuming you have enough RAM to hold the sparse matrix even after applying your condition, otherwise the computation cannot be done outright (in that case you'll have to break it into chunks, and compute one piece at-a-time). Say only 5% of the matrix is non-zero, that would be like 4GB of contiguous memory. So you have to choose an appropriate value for nzmx in the above code.

You mentioned that the distance function great_circle_distance is not relevant to the discussion here. Just be sure that it is properly vectorized, otherwise it could affect performance severely..

Update: Here is one possible improvement by only filling the lower half of the sparse matrix:

D = spalloc(N, N, nzmx);
for i=1:N-1
    d = great_circle_distance(locs(i,:), locs(i+1:end,:));
    idx = find(100<d & d<400);
    if isempty(idx), continue; end
    D(idx+i,i) = d(idx);
end

2) Method #2

Obviously since you're using pdist, it will compute the whole matrix (well only half of it). So you gotta make sure you have enough memory to store it (about ((N*(N-1)/2)*8)/2^30 = 37.25 GB according to my calculations).

Upvotes: 1

Related Questions