Reputation: 174
I'm stuck at something that I think could easily be solved in a couple of lines using Numpy, I just don't see it. Let's define an example array containing some missing values:
import numpy as np
input_data = np.array([[1,3,5,8,6],[3,np.nan,np.nan,5,6],[np.nan,6,7,np.nan,2]])
Out[530]: [[1, 3, 5, 8, 6], [3, nan, nan, 5, 6], [nan, 6, 7, nan, 2]]
What I'm looking for is to get an array that gives me for each element the distance to the previous valid value in each row. In the example above, this would be something like:
delta_valid = [[nan, 1, 1, 1, 1], [nan, 1, 2, 3, 1], [nan, nan, 1, 1, 2]]
The first element in each row would always be NaN because there is no previous value (not sure if there's a better way to define this).
Who can help me getting this result in Numpy? Thank you very much!
Upvotes: 2
Views: 346
Reputation: 221624
You are basically make ranges of (1,2,3,...)
until the next non-NaN
. To solve such cases, we could use some diff
+ cumsum
magic on each row, as shown below -
def closest_distance_per_row(a):
m0 = np.ones(a.shape,dtype=int)
mask = ~np.isnan(a)
for i,item in enumerate(a):
idx = np.flatnonzero(mask[i])
if len(idx)>0:
m0[i,:idx[0]] = 0
m0[i,idx[1:]] = idx[:-1] - idx[1:] +1
out = np.full(a.shape,np.nan,dtype=float)
out[:,1:] = m0[:,:-1].cumsum(1)
out[out==0] = np.nan
out[~mask.any(1)] = np.nan
return out
Sample runs -
In [353]: a
Out[353]:
array([[ 1., 3., 5., 8., 6.],
[ 3., nan, nan, 5., 6.],
[ nan, 6., 7., nan, 2.]])
In [354]: closest_distance_per_row(a)
Out[354]:
array([[ nan, 1., 1., 1., 1.],
[ nan, 1., 2., 3., 1.],
[ nan, nan, 1., 1., 2.]])
In [343]: a
Out[343]:
array([[ nan, nan, nan, nan, nan, nan, 4., nan, 3., 1.],
[ nan, nan, 6., nan, nan, nan, nan, nan, nan, nan],
[ 0., nan, 2., nan, 1., nan, 0., nan, nan, nan],
[ 3., nan, 2., nan, 8., 6., nan, 4., 2., nan],
[ nan, 0., nan, nan, nan, nan, nan, nan, nan, nan],
[ nan, nan, 2., nan, 0., nan, nan, 1., nan, nan]])
In [344]: closest_distance_per_row(a)
Out[344]:
array([[ nan, nan, nan, nan, nan, nan, nan, 1., 2., 1.],
[ nan, nan, nan, 1., 2., 3., 4., 5., 6., 7.],
[ nan, 1., 2., 1., 2., 1., 2., 1., 2., 3.],
[ nan, 1., 2., 1., 2., 1., 1., 2., 1., 1.],
[ nan, nan, 1., 2., 3., 4., 5., 6., 7., 8.],
[ nan, nan, nan, 1., 2., 1., 2., 3., 1., 2.]])
Runtime test -
In [4]: a = np.random.randint(0,9,(5000,5000)).astype(float)
In [5]: a.ravel()[np.random.choice(a.size, int(a.size*0.5), replace=0)] = np.nan
In [6]: %timeit two_loops(a)
1 loops, best of 3: 16.7 s per loop
In [7]: %timeit closest_distance_per_row(a)
1 loops, best of 3: 339 ms per loop
In [8]: 16700/339.0 # Speedup with one loop (proposed in this post) over two loops
Out[8]: 49.26253687315634
Upvotes: 1
Reputation: 6891
Here is a solution to your problem. It might not be optimal, as I it might be possible to do something more fancy with map and/or list comprehensions but at least it solves your immediate issue:
import numpy as np
input_data = np.array([[1,3,5,8,6],[3,np.nan,np.nan,5,6],[np.nan,6,7,np.nan,2]])
def distance(vector):
dist = np.nan
dists = []
for a in vector:
dists.append(dist)
dist = dist + 1 if np.isnan(a) else 1
return np.array(dists)
dists = np.empty(input_data.shape)
for row_num, row in enumerate(input_data):
dists[row_num, :] = distance(row)
It also only works for 2d arrays currently, but it could probably be generalized pretty easily.
Also, the above piece of code is not very optimized. In order to make a more fair comparison with the accepted answer, here comes a more optimized version, with no extra function calls, or list builds:
def two_loops(input_data):
dists = np.empty(input_data.shape)
for row_num, row in enumerate(input_data):
dist = np.nan
for col_num, value in enumerate(row):
dists[row_num, col_num] = dist
dist = dist + 1 if np.isnan(value) else 1
return dists
This makes the execution times are more similar. When I measure, my solution takes about twice as long to execute.
Upvotes: 1