Sebastian
Sebastian

Reputation: 1438

Slow Row-wise Comparison with For-loops in NumPy - How to improve?

I'm using python and numpy to compare two arrays or equal shape with coordinates (x,y,z) in order to match them, which look like that:

coordsCFS
array([[ 0.02      ,  0.02      ,  0.        ],
       [ 0.03      ,  0.02      ,  0.        ],
       [ 0.02      ,  0.025     ,  0.        ],
        ..., 
       [ 0.02958333,  0.029375  ,  0.        ],
       [ 0.02958333,  0.0290625 ,  0.        ],
       [ 0.02958333,  0.0296875 ,  0.        ]])

and

coordsRMED
array([[ 0.02      ,  0.02      ,  0.        ],
       [ 0.02083333,  0.02      ,  0.        ],
       [ 0.02083333,  0.020625  ,  0.        ],
       ..., 
       [ 0.03      ,  0.0296875 ,  0.        ],
       [ 0.02958333,  0.03      ,  0.        ],
       [ 0.02958333,  0.0296875 ,  0.        ]]) 

The data are read from two hdf5 files with h5py. For the comparison I use allclose, which tests for "almost equality". The coordinates do not match within python's regular floating point precision. This is the reason I used the for loops, otherwise it would have worked with numpy.where. I usually try to avoid for loops, but in this context I couldn't figure out how. So I came up with this surprisingly slow snippet:

mapList = []
for cfsXYZ in coordsCFS:
    # print cfsXYZ
    indexMatch = 0
    match = []
    for asterXYZ in coordRMED:
        if numpy.allclose(asterXYZ,cfsXYZ):
            match.append(indexMatch)
            # print "Found match at index " + str(indexMatch)
            # print asterXYZ
        indexMatch += 1

    # check: must only find one match. 
    if len(match) != 1:
        print "ERROR matching"
        print match
        print cfsXYZ
        return 1

    # save to list
    mapList.append(match[0])

if len(mapList) != coordsRMED.shape[0]:
    print "ERROR: matching consistency check"
    print mapList
    return 1

This is very slow for my test sample size (800 rows). I plan to compare much larger sets. I could remove the consistency check and use break in the inner for loop for some speed benefit. Is there still a better way?

Upvotes: 0

Views: 784

Answers (3)

Janne Karila
Janne Karila

Reputation: 25207

You could get rid of the inner loop with something like this:

for cfsXYZ in coordsCFS:
    match = numpy.nonzero(
        numpy.max(numpy.abs(coordRMED - cfsXYZ), axis=1) < TOLERANCE)

Upvotes: 1

Pierre GM
Pierre GM

Reputation: 20339

A first thing to remember is that by default, in NumPy, "the iteration always proceeds in a C-style contiguous fashion (last index varying the fastest)"[1]. You might improve things by reversing the order of iteration (iterate on coordMED.T, the transpose of coordMED...)

Nevertheless, I'm still surprised by you need for a loop: you claim that 'The coordinates do not match within python's regular floating point precision': have you tried to adjust the rtol and atol parameters of np.allclose, as described in its doc?

[1]

Upvotes: 1

nneonneo
nneonneo

Reputation: 179592

One solution is to sort both arrays (adding an index column so that the sorted arrays still contains the original indices). Then, to match, step through the arrays in lock-step. Since you're expecting a precise 1-1 correspondence, you should always be able to match pairs of rows off.

Upvotes: 1

Related Questions