Reputation: 491

Python Optimized Comparison Between List of Dict

I'm trying to see whether nodes reside within the volume of a sphere, and add the node id to a list. However, the efficiency of the algorithm is incredibly slow and I'm not sure how to improve it.

I have two lists. List A has the format [{'num': ID, 'x': VALUE, 'y': VALUE, 'z': VALUE] while List B has the format [{'x': VALUE, 'y': VALUE, 'z': VALUE, 'rad': VALUE}].

The size of both lists can run upwards of 100,000 items each.

My current code is posted below, but it's very inefficient.

    filteredList = []
    for i in range(len(sList)):

            minx = (sList[i]['x']) - (sList[i]['radius'])
            maxx = (sList[i]['x']) + (sList[i]['radius'])
            miny = (sList[i]['y']) - (sList[i]['radius'])
            maxy = (sList[i]['y']) + (sList[i]['radius'])
            minz = (sList[i]['z']) - (sList[i]['radius'])
            maxz = (sList[i]['z']) + (sList[i]['radius'])

            for j in range(len(nList)):
                    if minx <= nList[j]['x'] <= maxx:
                            if miny <= nList[j]['y'] <= maxy:
                                    if minz <= nList[j]['z'] <= maxz:
                                            tmpRad = findRadius(sList[i],nList[j])
                                            if tmpRad <= sList[i]['radius']:
                                                    filteredList.append(int(nList[j]['num']))

I'm at a loss and appreciate any ideas.

Edit: Adding extra information about the data format.

List A (nList) -- defines nodes, with locations x,y,z, and identifier num
[{'y': 0.0, 'x': 0.0, 'num': 1.0, 'z': 0.0},
{'y': 0.0, 'x': 1.0, 'num': 2.0, 'z': 0.0},
{'y': 0.0, 'x': 2.0, 'num': 3.0, 'z': 0.0},
{'y': 0.0, 'x': 3.0, 'num': 4.0, 'z': 0.0},
{'y': 0.0, 'x': 4.0, 'num': 5.0, 'z': 0.0},
{'y': 0.0, 'x': 5.0, 'num': 6.0, 'z': 0.0},
{'y': 0.0, 'x': 6.0, 'num': 7.0, 'z': 0.0},
{'y': 0.0, 'x': 7.0, 'num': 8.0, 'z': 0.0},
{'y': 0.0, 'x': 8.0, 'num': 9.0, 'z': 0.0},
{'y': 0.0, 'x': 9.0, 'num': 10.0, 'z': 0.0}]

List B (sList) -- defines spheres using x,y,z, radius
[{'y': 18.0, 'x': 25.0, 'z': 26.0, 'radius': 0.0056470000000000001},
{'y': 29.0, 'x': 23.0, 'z': 45.0, 'radius': 0.0066280000000000002},
{'y': 46.0, 'x': 29.0, 'z': 13.0, 'radius': 0.014350999999999999},
{'y': 0.0, 'x': 20.0, 'z': 25.0, 'radius': 0.014866000000000001},
{'y': 27.0, 'x': 31.0, 'z': 18.0, 'radius': 0.018311999999999998},
{'y': 10.0, 'x': 36.0, 'z': 46.0, 'radius': 0.024702000000000002},
{'y': 27.0, 'x': 13.0, 'z': 48.0, 'radius': 0.027300999999999999},
{'y': 14.0, 'x': 1.0, 'z': 13.0, 'radius': 0.033889000000000002},
{'y': 31.0, 'x': 20.0, 'z': 11.0, 'radius': 0.034118999999999997},
{'y': 23.0, 'x': 28.0, 'z': 8.0, 'radius': 0.036683}]

Upvotes: -1

Answers (6)

Shuo

Reputation: 491

Taking in all this advice, I managed to come up with a solution that was about 50x faster than the original.

I realized that the bottleneck was in the datatype (list of dicts) I was using. Looping over multiple lists was incredibly slow in my cast and using sets was much more efficient.

First thing I did was to implement named tuples. I knew how my list of nodes was numbered which provided the hash I needed for efficiency.

def findNodesInSpheres(sList,nList,nx,ny,nz):
    print "Running findNodesInSpheres"
    filteredList = []
    for a in sList:
            rad = a.radius
            minx = (a.x) - (rad) if (a.x - rad > 0) else 0
            maxx = (a.x) + (rad) if (a.x + rad < nx ) else nx
            miny = (a.y) - (rad) if (a.y - rad > 0) else 0
            maxy = (a.y) + (rad) if (a.y + rad < ny ) else ny
            minz = (a.z) - (rad) if (a.z - rad > 0) else 0
            maxz = (a.z) + (rad) if (a.z + rad < nz ) else nz
            boundingBox = set([ (i + j * (nx + 1) + k * (nx + 1) * (ny + 1)) for i in range (int(minx),int(maxx)+1)
                            for j in range (int(miny),int(maxy)+1) for k in range(int(minz),int(maxz)+1) ])

            for b in sorted(boundingBox):
                    if findRadius(a,nList[b]) <= rad:
                            filteredList.append(nList[b].num)
    return filteredList

Using set() instead of list provided massive speedups. The larger the data set (nx, ny, nz), the more the speedup.

It could still be improved using tree implementation and domain decomposition as has been suggested, but for the moment it works.

Thanks to everyone for the advice!

Upvotes: 0

Marcelo Cantos

Reputation: 185842

(AFAICT, the following solution is algorithmically faster than any other answer posted so far: approximately O(N log N) vs O(N²). Caveat: this assumes that you don't have massive amounts of overlap between bounding boxes.)

If you are allowed to pre-compute an index structure:

Push all the min/max x values into a set and sort them, thus creating a list of vertical regions spanning the x-axis. Associate each region with the set of bounding boxes that contain it.
Repeat this procedure for min/max y values, to create a list of horizontal regions, and associate each region with the set of bounding boxes it contains.
For each point being tested:
- Use a binary chop to find the horizontal region that contains the point's x coordinate. What you really want, though, is the set of bounding boxes associated with the region.
- Likewise, find the set of bounding boxes associated with the y coordinate.
- Find the intersection of these two sets.
Test the bounding boxes in this residue set using Pythagoras.

Upvotes: 0

Chris Morgan

Reputation: 90742

(This answer deals with simple optimisations and Python style; it works with the existing algorithm, teaching some points of optimisation, rather than replacing it with a more efficient one.)

Here are some points to start with to make the code easier to read and understand:

Iterate over sList, not over range(len(sList)). for i in range(len(sList)) becomes for i in sList and sList[i] becomes i.
No need for that tmpRad; put it inline.
Instead of if a: if b: if c: use if a and b and c.

Now we're at this:

filteredList = []
for i in sList:
    minx = i['x'] - i['radius']
    maxx = i['x'] + i['radius']
    miny = i['y'] - i['radius']
    maxy = i['y'] + i['radius']
    minz = i['z'] - i['radius']
    maxz = i['z'] + i['radius']

    for j in nList:
        if minx <= j['x'] <= maxx and miny <= j['y'] <= maxy and minz <= j['z'] <= maxz and findRadius(i,j) <= i['radius']:
            filteredList.append(int(j['num']))

(PEP 8 would recommend splitting that long line to lines of no more than 80 characters; PEP 8 would also recommend filtered_list and s_list and n_list rather than filteredList, sList and nList.)

I've put the findRadius(i, j) <= i['radius'] first for style and because it looks like it might be more likely to evaluate to false, speeding up calculations. Then I've also inlined the minx etc. variables:

filteredList = []
for i in sList:
    for j in nList:
        if findRadius(i, j) <= i['radius'] \
        and i['x'] - i['radius'] <= j['x'] <= i['x'] + i['radius'] \
        and i['y'] - i['radius'] <= j['y'] <= i['y'] + i['radius'] \
        and i['z'] - i['radius'] <= j['z'] <= i['z'] + i['radius']:
            filteredList.append(int(j['num']))

One thing to think about is that i['x'] - i['radius'] <= j['x'] <= i['x'] + i['radius'] could be simplified; try things like subtracting i['x'] from all three parts.

You can shorten this even more with a list comprehension.

filteredList = [int(j['num']) for j in nList for i in sList
        if findRadius(i, j) <= i['radius']
        and i['x'] - i['radius'] <= j['x'] <= i['x'] + i['radius']
        and i['y'] - i['radius'] <= j['y'] <= i['y'] + i['radius']
        and i['z'] - i['radius'] <= j['z'] <= i['z'] + i['radius']]

And finally, named tuples (this has the side-effect of making them immutable, too, which is probably desired? Also note it's Python 2.6 only, read the page for how you could do it with older versions of Python):

from collections import namedtuple

node = namedtuple('node', 'x y z num')
sphere = namedtuple('sphere', 'x y z radius')

nList = [
        node(x=0.0, y=0.0, z=0.0, num=1.0),
        node(x=1.0, y=0.0, z=0.0, num=2.0),
        node(x=2.0, y=0.0, z=0.0, num=3.0),
        node(x=3.0, y=0.0, z=0.0, num=4.0),
        node(x=4.0, y=0.0, z=0.0, num=5.0),
        node(x=5.0, y=0.0, z=0.0, num=6.0),
        node(x=6.0, y=0.0, z=0.0, num=7.0),
        node(x=7.0, y=0.0, z=0.0, num=8.0),
        node(x=8.0, y=0.0, z=0.0, num=9.0),
        node(x=9.0, y=0.0, z=0.0, num=10.0)]

sList = [
        sphere(x=25.0, y=18.0, z=26.0, radius=0.0056470000000000001),
        sphere(x=23.0, y=29.0, z=45.0, radius=0.0066280000000000002),
        sphere(x=29.0, y=46.0, z=13.0, radius=0.014350999999999999),
        sphere(x=20.0, y=0.0, z=25.0, radius=0.014866000000000001),
        sphere(x=31.0, y=27.0, z=18.0, radius=0.018311999999999998),
        sphere(x=36.0, y=10.0, z=46.0, radius=0.024702000000000002),
        sphere(x=13.0, y=27.0, z=48.0, radius=0.027300999999999999),
        sphere(x=1.0, y=14.0, z=13.0, radius=0.033889000000000002),
        sphere(x=20.0, y=31.0, z=11.0, radius=0.034118999999999997),
        sphere(x=28.0, y=23.0, z=8.0, radius=0.036683)]

Then, instead of sphere['radius'] you can do sphere.radius. This makes the code neater:

filteredList = [int(j.num) for j in nList for i in sList
        if findRadius(i, j) <= i.radius
        and i.x - i.radius <= j.x <= i.x + i.radius
        and i.y - i.radius <= j.y <= i.y + i.radius
        and i.z - i.radius <= j.z <= i.z + i.radius]

Or, without the list comprehension,

filteredList = []
for i in sList:
    for j in nList:
        if findRadius(i, j) <= i.radius \
        and i.x - i.radius <= j.x <= i.x + i.radius \
        and i.y - i.radius <= j.y <= i.y + i.radius \
        and i.z - i.radius <= j.z <= i.z + i.radius:
            filteredList.append(int(j.num))

Finally, choose nicer names; [style changed slightly as per comments, putting findRadius at the end as it's more likely to be computationally expensive - you're the best judge of that, though]

filteredList = [int(n.num) for n in nodes for s in spheres
        if s.x - s.radius <= n.x <= s.x + s.radius and
            s.y - s.radius <= n.y <= s.y + s.radius and
            s.z - s.radius <= n.z <= s.z + s.radius and
            findRadius(s, n) <= s.radius]

Or,

filteredList = []
for s in spheres:
    for n in nodes:
        if (s.x - s.radius <= n.x <= s.x + s.radius and
            s.y - s.radius <= n.y <= s.y + s.radius and
            s.z - s.radius <= n.z <= s.z + s.radius and
            findRadius(s, n) <= s.radius):
            filteredList.append(int(n.num))

(You could put srad = s.radius in the outer loop for a probable slight performance gain if desired.)

Upvotes: 3

Kabie

Reputation: 10663

Actually, you could save all that by:

filteredList = [int(node['num']) for sphere in sList \
    for node in nList if findRadius(sphere,node)<=sphere['radius']]

If the distance from a point to a sphere's globe is less than the sphere's radius, then I guess we can say it is in the sphere, right?

I assume findRadius is defined like:

def findRadius(sphere,node):
    return ((node['x']-sphere['x'])**2 + \
            (node['y']-sphere['y'])**2 + \
            (node['z']-sphere['z'])**2)**.5

Upvotes: 0

Karl Knechtel

Reputation: 61498

First off, Python isn't built for that kind of iteration. Using indices to get at each element of a list is backwards, a kind of brain-damage that's taught by low-level languages where it's faster. In Python it's actually slower. range(len(whatever)) actually creates a new list of numbers, and then you work with the numbers that are handed to you from that list. What you really want to do is just work with objects that are handed to you from whatever.

While we're at it, we can pull out the common s['radius'] bit that is checked several times, and put all the if-checks for the bounding box on one line. Oh, and we don't need a separate 'tmpRad', and I assume the 'num's are already ints and don't need to be converted (if they do, why? Why not just have them converted ahead of time?)

None of this will make a huge difference, but it at least makes it easier to read, and definitely doesn't hurt.

filteredList = []
for s in sList:
  radius = s['radius']
  minx = s['x'] - radius
  maxx = s['x'] + radius
  miny = s['y'] - radius
  maxy = s['y'] + radius
  minz = s['z'] - radius
  maxz = s['z'] + radius

  for n in nList:
    if (minx <= n['x'] <= maxx) and (miny <= n['y'] <= maxy) and \
       (minz <= n['z'] <= maxz) and (findRadius(s, n) <= radius): 
      filteredList.append(n['num'])

Now it's at least clear what's going on.

However, for the scale of the problem we're working with, it sounds like we're going to need algorithmic improvements. What you probably want to do here is use some kind of BSP (binary space partitioning) technique. The way this works is:

First, we rearrange the nList into a tree. We cut it up into 8 smaller lists, based on whether x > 0, whether y > 0 and whether z > 0 for each point (8 combinations of the 3 boolean results). Then each of those gets cut into 8 again, using the same sort of criteria - e.g. if the possible range for x/y/z is -10..10, then we cut the "x > 0, y > 0, z > 0" list up according to whether x > 5, y > 5, z > 5, etc. You get the idea.
For each point in the sList, we check whether minx > 0, etc. The beautiful part: if minx > 0, we don't have to check any of the 'x < 0' lists, and if maxx < 0, we don't have to check any of the 'x > 0' lists. And so forth. We figure out which of the 8 "octants" of the space the bounding box intersects with; and for each of those, we recursively check the appropriate octants of those octants, etc. until we get to the leaves of the tree, and then we do the normal point-in-bounding-box, then point-in-sphere tests.

Upvotes: 1

Dan D.

Reputation: 74645

one we can remove from the sample

unless you need to iterate over a list by index, one shouldn't, also avoid using range, and merge ifs together

filteredList = []
for a in sList:

        minx = (a['x']) - (a['radius'])
        maxx = (a['x']) + (a['radius'])
        miny = (a['y']) - (a['radius'])
        maxy = (a['y']) + (a['radius'])
        minz = (a['z']) - (a['radius'])
        maxz = (a['z']) + (a['radius'])

        for b in nList:
                if minx <= b['x'] <= maxx and miny <= b['y'] <= maxy and minz <= b['z'] <= maxz:
                    tmpRad = findRadius(a,b)
                    if tmpRad <= a['radius']:
                        filteredList.append(int(b['num']))

Upvotes: 1

Python Optimized Comparison Between List of Dict

Answers (6)

Related Questions