Reputation: 35
I have two data files, each of them contain a big number of 3-dimensional points (file A stores approximately 50,000 points, file B stores approximately 500,000 points). My goal is to find for every point (a) in file A the point (b) in file B which has the smallest distance to (a). I store the points in two lists like this:
List A nodes:
(ID X Y Z)
[ ['478277', -107.0, 190.5674, 128.1634],
['478279', -107.0, 190.5674, 134.0172],
['478282', -107.0, 190.5674, 131.0903],
['478283', -107.0, 191.9798, 124.6807],
... ]
List B data:
(X Y Z Data)
[ [-28.102, 173.657, 229.744, 14.318],
[-28.265, 175.549, 227.824, 13.648],
[-27.695, 175.925, 227.133, 13.142],
...]
My first approach was to simply iterate through the first and second list with a nested loop and compute the distance between every points like this:
outfile = open(job[0] + '/' + output, 'wb');
dist_min = float(job[5]);
dist_max = float(job[6]);
dists = [];
for node in nodes:
shortest_distance = 1000.0;
shortest_data = 0.0;
for entry in data:
dist = math.sqrt((node[1] - entry[0])**2 + (node[2] - entry[1])**2 + (node[3] - entry[2])**2);
if (dist_min <= dist <= dist_max) and (dist < shortest_distance):
shortest_distance = dist;
shortest_data = entry[3];
outfile.write(node[0] + ', ' + str('%10.5f' % shortest_data + '\n'));
outfile.close();
I recognized that the amount of loops Python has to run is way too big (~25,000,000,000), so I had to fasten my code. I tried to first calculate all distances with list comprehensions but the code still is too slow:
p_x = [row[1] for row in nodes];
p_y = [row[2] for row in nodes];
p_z = [row[3] for row in nodes];
q_x = [row[0] for row in data];
q_y = [row[1] for row in data];
q_z = [row[2] for row in data];
dx = [[(px - qx) for px in p_x] for qx in q_x];
dy = [[(py - qy) for py in p_y] for qy in q_y];
dz = [[(pz - qz) for pz in p_z] for qz in q_z];
dx = [[dxxx * dxxx for dxxx in dxx] for dxx in dx];
dy = [[dyyy * dyyy for dyyy in dyy] for dyy in dy];
dz = [[dzzz * dzzz for dzzz in dzz] for dzz in dz];
D = [[(dx[i][j] + dy[i][j] + dz[i][j]) for j in range(len(dx[0]))] for i in range(len(dx))];
D = [[(DDD**(0.5)) for DDD in DD] for DD in D];
To be honest, at this point, I do not know which of the two approaches is better, anyway, none of the two possibilities seem feasible. I'm not even sure if it is possible to write a code which calculates all distances in an acceptable time. Is there even another way to solve my problem without calculating all distances?
Edit: I forgot to mention that I am running on Python 2.5.1 and am not allowed to install or add any new libraries...
Upvotes: 2
Views: 3481
Reputation: 1
take a look at this generic 3D data structure:
https://github.com/m4nh/skimap_ros
it has a very fast RadiusSearch feature just ready to be used. This solution (similar to Octree but faster) avoids to you to create the Regular Grid first (you don't have to fix MAX/MIN size along each axis) and you save a lot of memory
Upvotes: 0
Reputation: 35
Just in case someone is interrested in the solution:
I found a way to speed up the whole process by not calculating all distances:
I created a 3D-list, representing a grid in the given 3D space, divided in X, Y and Z in a given step size (e.g. (Max. - Min.) / 1,000). Then I iterated over every 3D point to put it into my grid. After that I iterated over the points of set A again, looking if there are points from B in the same cube, if not I would increase the search radius, so the process is looking in the adjacent 26 cubes for points. The radius is increasing until there is at least one point found. The resulting list is comparatively small and can be ordered in short time and the nearest point is found.
The processing time went down to a couple minutes and it is working fine.
p_x = [row[1] for row in nodes];
p_y = [row[2] for row in nodes];
p_z = [row[3] for row in nodes];
q_x = [row[0] for row in data];
q_y = [row[1] for row in data];
q_z = [row[2] for row in data];
min_x = min(p_x + q_x);
min_y = min(p_y + q_y);
min_z = min(p_z + q_z);
max_x = max(p_x + q_x);
max_y = max(p_y + q_y);
max_z = max(p_z + q_z);
max_n = max(max_x, max_y, max_z);
min_n = min(min_x, min_y, max_z);
gridcount = 1000;
step = (max_n - min_n) / gridcount;
ruler_x = [min_x + (i * step) for i in range(gridcount + 1)];
ruler_y = [min_y + (i * step) for i in range(gridcount + 1)];
ruler_z = [min_z + (i * step) for i in range(gridcount + 1)];
grid = [[[0 for i in range(gridcount)] for j in range(gridcount)] for k in range(gridcount)];
for node in nodes:
loc_x = self.abatemp_get_cell(node[1], ruler_x);
loc_y = self.abatemp_get_cell(node[2], ruler_y);
loc_z = self.abatemp_get_cell(node[3], ruler_z);
if grid[loc_x][loc_y][loc_z] is 0:
grid[loc_x][loc_y][loc_z] = [[node[1], node[2], node[3], node[0]]];
else:
grid[loc_x][loc_y][loc_z].append([node[1], node[2], node[3], node[0]]);
for entry in data:
loc_x = self.abatemp_get_cell(entry[0], ruler_x);
loc_y = self.abatemp_get_cell(entry[1], ruler_y);
loc_z = self.abatemp_get_cell(entry[2], ruler_z);
if grid[loc_x][loc_y][loc_z] is 0:
grid[loc_x][loc_y][loc_z] = [[entry[0], entry[1], entry[2], entry[3]]];
else:
grid[loc_x][loc_y][loc_z].append([entry[0], entry[1], entry[2], entry[3]]);
out = [];
outfile = open(job[0] + '/' + output, 'wb');
for node in nodes:
neighbours = [];
radius = -1;
loc_nx = self.abatemp_get_cell(node[1], ruler_x);
loc_ny = self.abatemp_get_cell(node[2], ruler_y);
loc_nz = self.abatemp_get_cell(node[3], ruler_z);
reloop = True;
while reloop:
if neighbours:
reloop = False;
radius += 1;
start_x = 0 if ((loc_nx - radius) < 0) else (loc_nx - radius);
start_y = 0 if ((loc_ny - radius) < 0) else (loc_ny - radius);
start_z = 0 if ((loc_nz - radius) < 0) else (loc_nz - radius);
end_x = (len(ruler_x) - 1) if ((loc_nx + radius + 1) > (len(ruler_x) - 1)) else (loc_nx + radius + 1);
end_y = (len(ruler_y) - 1) if ((loc_ny + radius + 1) > (len(ruler_y) - 1)) else (loc_ny + radius + 1);
end_z = (len(ruler_z) - 1) if ((loc_nz + radius + 1) > (len(ruler_z) - 1)) else (loc_nz + radius + 1);
for i in range(start_x, end_x):
for j in range(start_y, end_y):
for k in range(start_z, end_z):
if not grid[i][j][k] is 0:
for grid_entry in grid[i][j][k]:
if not isinstance(grid_entry[3], basestring):
neighbours.append(grid_entry);
dists = [];
for n in neighbours:
d = math.sqrt((node[1] - n[0])**2 + (node[2] - n[1])**2 + (node[3] - n[2])**2);
dists.append([d, n[3]]);
dists = sorted(dists);
outfile.write(node[0] + ', ' + str(dists[0][-1]) + '\n');
outfile.close();
Function to get the position of a point:
def abatemp_get_cell(self, n, ruler):
for i in range(len(ruler)):
if i >= len(ruler):
return False;
if ruler[i] <= n <= ruler[i + 1]:
return i;
The gridcount variable gives one the chance to fasten the process, with a small gridcount the process of sorting the points into the grid is very fast, but the lists of neighbours in the search loop gets bigger and more time is needed for this part of the process. With a big gridcount more time is needed at the beginning, however the loop runs faster.
The only issue I have now is the fact, that there are cases when the process found neighbours but there are other points, which are not yet found, but are closer to the point (see picture). So far I solved this issue by incrementing the search radius another time when there are already neigbours. And still then I have points which are closer but not in the neighbours list, although it's a very small amount (92 out of ~100,000). I could solve this problem by increment the radius two times after finding neighbours, but this solution seems not very smart. Maybe you guys have an idea...
This is the first working draft of the process, I think it will be possible to improve it even more, just to give you an idea of how it is working...
Upvotes: 1
Reputation: 123
It took me a bit of thought but at the end I think I found a solution for you.
Your problem is not in the code you wrote but in the algorithm it implements.
There is an algorithm called Dijkstra's algorithm and here is the gist of it: https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm .
Now what you need to do is to use this algorithm in a clever way:
create a node S (stand for source). Now link edges from S to all the nodes in B group. After you done that you should link edges from each point b in B to each point a in A. You should set the cost of the links from the source to 0 and the other to the distance between 2 points (only in 3D). Now if we will use Dijkstra's algorithm the output we will get would be the cost to travel from S to each point in the graph (we are only interested in the distance to points in group A). So since the cost is 0 to each point b in B and S is only connected to points in B so the road to any point a in A must include a node in B (actually exactly one since the shortest distance between to points is a single line).
I am not sure if this will fasten your code but as far as I know, a way to solve this problem without calculating all distances does not exist and this algorithm is the best time complexity one could hope for.
Upvotes: 0