Kristof
Kristof

Reputation: 174

How to get the distance to the closest previous finite number in a row using Numpy

I'm stuck at something that I think could easily be solved in a couple of lines using Numpy, I just don't see it. Let's define an example array containing some missing values:

import numpy as np
input_data = np.array([[1,3,5,8,6],[3,np.nan,np.nan,5,6],[np.nan,6,7,np.nan,2]])

Out[530]: [[1, 3, 5, 8, 6], [3, nan, nan, 5, 6], [nan, 6, 7, nan, 2]]

What I'm looking for is to get an array that gives me for each element the distance to the previous valid value in each row. In the example above, this would be something like:

delta_valid = [[nan, 1, 1, 1, 1], [nan, 1, 2, 3, 1], [nan, nan, 1, 1, 2]]

The first element in each row would always be NaN because there is no previous value (not sure if there's a better way to define this).

Who can help me getting this result in Numpy? Thank you very much!

Upvotes: 2

Views: 346

Answers (2)

Divakar
Divakar

Reputation: 221624

You are basically make ranges of (1,2,3,...) until the next non-NaN. To solve such cases, we could use some diff + cumsum magic on each row, as shown below -

def closest_distance_per_row(a):
    m0 = np.ones(a.shape,dtype=int)
    mask = ~np.isnan(a)
    for i,item in enumerate(a):
        idx = np.flatnonzero(mask[i])
        if len(idx)>0:
            m0[i,:idx[0]] = 0
            m0[i,idx[1:]] = idx[:-1] - idx[1:] +1

    out = np.full(a.shape,np.nan,dtype=float)
    out[:,1:] = m0[:,:-1].cumsum(1)
    out[out==0] = np.nan
    out[~mask.any(1)] = np.nan
    return out

Sample runs -

In [353]: a
Out[353]: 
array([[  1.,   3.,   5.,   8.,   6.],
       [  3.,  nan,  nan,   5.,   6.],
       [ nan,   6.,   7.,  nan,   2.]])

In [354]: closest_distance_per_row(a)
Out[354]: 
array([[ nan,   1.,   1.,   1.,   1.],
       [ nan,   1.,   2.,   3.,   1.],
       [ nan,  nan,   1.,   1.,   2.]])

In [343]: a
Out[343]: 
array([[ nan,  nan,  nan,  nan,  nan,  nan,   4.,  nan,   3.,   1.],
       [ nan,  nan,   6.,  nan,  nan,  nan,  nan,  nan,  nan,  nan],
       [  0.,  nan,   2.,  nan,   1.,  nan,   0.,  nan,  nan,  nan],
       [  3.,  nan,   2.,  nan,   8.,   6.,  nan,   4.,   2.,  nan],
       [ nan,   0.,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,   2.,  nan,   0.,  nan,  nan,   1.,  nan,  nan]])

In [344]: closest_distance_per_row(a)
Out[344]: 
array([[ nan,  nan,  nan,  nan,  nan,  nan,  nan,   1.,   2.,   1.],
       [ nan,  nan,  nan,   1.,   2.,   3.,   4.,   5.,   6.,   7.],
       [ nan,   1.,   2.,   1.,   2.,   1.,   2.,   1.,   2.,   3.],
       [ nan,   1.,   2.,   1.,   2.,   1.,   1.,   2.,   1.,   1.],
       [ nan,  nan,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.],
       [ nan,  nan,  nan,   1.,   2.,   1.,   2.,   3.,   1.,   2.]])

Runtime test -

In [4]: a = np.random.randint(0,9,(5000,5000)).astype(float)

In [5]: a.ravel()[np.random.choice(a.size, int(a.size*0.5), replace=0)] = np.nan

In [6]: %timeit two_loops(a)
1 loops, best of 3: 16.7 s per loop

In [7]: %timeit closest_distance_per_row(a)
1 loops, best of 3: 339 ms per loop

In [8]: 16700/339.0 # Speedup with one loop (proposed in this post) over two loops
Out[8]: 49.26253687315634

Upvotes: 1

JohanL
JohanL

Reputation: 6891

Here is a solution to your problem. It might not be optimal, as I it might be possible to do something more fancy with map and/or list comprehensions but at least it solves your immediate issue:

import numpy as np
input_data = np.array([[1,3,5,8,6],[3,np.nan,np.nan,5,6],[np.nan,6,7,np.nan,2]])

def distance(vector):
    dist = np.nan
    dists = []
    for a in vector:
        dists.append(dist)
        dist = dist + 1 if np.isnan(a) else 1
    return np.array(dists)

dists = np.empty(input_data.shape)
for row_num, row in enumerate(input_data):
    dists[row_num, :] = distance(row)

It also only works for 2d arrays currently, but it could probably be generalized pretty easily.

Also, the above piece of code is not very optimized. In order to make a more fair comparison with the accepted answer, here comes a more optimized version, with no extra function calls, or list builds:

def two_loops(input_data):
    dists = np.empty(input_data.shape)
    for row_num, row in enumerate(input_data):
        dist = np.nan
        for col_num, value in enumerate(row):
            dists[row_num, col_num] = dist
            dist = dist + 1 if np.isnan(value) else 1
    return dists

This makes the execution times are more similar. When I measure, my solution takes about twice as long to execute.

Upvotes: 1

Related Questions