Wesley Kitlasten
Wesley Kitlasten

Reputation: 417

Fastest way to find nearest nonzero value in array from columns in pandas dataframe

I am looking for the nearest nonzero cell in a numpy 3d array based on the i,j,k coordinates stored in a pandas dataframe. My solution below works, but it is slower than I would like. I know my optimization skills are lacking, so I am hoping someone can help me find a faster option.

It takes 2 seconds to find the nearest non-zero for a 100 x 100 x 100 binary array, and I have hundreds of files, so any speed enhancements would be much appreciated!

a=np.random.randint(0,2,size=(100,100,100))
# df with i,j,k of interest
df=pd.DataFrame(np.random.randint(100,size=(100,3)).tolist(),
                columns=['i','j','k'])

def find_nearest(a,df):
    import numpy as np
    import pandas as pd
    import time
    t0=time.time()
    nzi = np.nonzero(a)
    for i,r in df.iterrows():
        dist = ((r['k'] - nzi[0])**2 + \
              (r['i'] - nzi[1])**2 + \
              (r['j'] - nzi[2])**2)
        nidx = dist.argmin()
        df.loc[i,['nk','ni','nj']]=(nzi[0][nidx],
                                    nzi[1][nidx],
                                    nzi[2][nidx])
    print(time.time()-t0)
    return(df)

Upvotes: 1

Views: 256

Answers (1)

Jérôme Richard
Jérôme Richard

Reputation: 50826

The problem that you are trying to solve looks like a nearest-neighbor search.

The worst-case complexity of the current code is O(n m) with n the number of point to search and m the number of neighbour candidates. With n = 100 and m = 100**3 = 1,000,000, this means about hundreds of million iterations. To solve this efficiently, one can use a better algorithm.

The common way to solve this kind of problem consists in putting all elements in a BSP-Tree data structure (such as Quadtree or Octree. Such a data structure helps you to locate the nearest elements near a location in a O(log(m)) time. As a result, the overall complexity of this method is O(n log(m))! SciPy already implement KD-trees.

Vectorization generally also help to speed up the computation.

def find_nearest_fast(a,df):
    from scipy.spatial import KDTree
    import numpy as np
    import pandas as pd
    import time
    t0=time.time()
    candidates = np.array(np.nonzero(a)).transpose().copy()
    tree = KDTree(candidates, leafsize=1024, compact_nodes=False)
    searched = np.array([df['k'], df['i'], df['j']]).transpose()
    distances, indices = tree.query(searched)
    nearestPoints = candidates[indices,:]
    df[['nk', 'ni', 'nj']] = nearestPoints
    print(time.time()-t0)
    return df

This implementation is 16 times faster on my machine. Note the results differ a bit since there are multiple nearest points for a given input point (with the same distance).

Upvotes: 1

Related Questions