Replace zeros in an array with a continuous sequence of integers

I have an array that contains NaN values or zeros as shown below. I would like to go through the array and replace every 0 with an integer, in an increasing sequence. I.e., the first zero becomes "1", the next zero becomes "2", then "3", etc.

Input:

arrayOfZeros = 

array([[nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [ 0., nan, nan, nan, nan],
       [ 0., nan,  0., nan,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [nan,  0.,  0.,  0.,  0.],
       [nan,  0., nan, nan, nan],
       [nan, nan,  0., nan, nan],
       [ 0., nan,  0., nan,  0.],
       [ 0., nan,  0., nan,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [nan, nan,  0.,  0.,  0.],
       [nan, nan, nan, nan,  0.],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan]])

The desired output:

       [nan, nan, nan, nan, nan],
       [ 1., nan, nan, nan, nan],
       [ 2., nan, 19., nan, 39.],
       [ 3., 11., 20., 31., 40.],
       [ 4., 12., 21., 32., 41.],
       [nan, 13., 22., 33., 42.],
       [nan, 14., nan, nan, nan],
       [nan, nan, 23., nan, nan],
       [ 5., nan, 24., nan, 43.],
       [ 6., nan, 25., nan, 44.],
       [ 7., 15., 26., 34., 45.],
       [ 8., 16., 27., 35., 46.],
       [ 9., 17., 28., 36., 47.],
       [10., 18., 29., 37., 48.],
       [nan, nan, 30., 38., 49.],
       [nan, nan, nan, nan, 50.],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan]])

Currently, I can almost do exactly what I want with the following code:

    with np.nditer(arrayOfZeros, op_flags=['readwrite']) as y:
        preference = 1
        for x in y:
            if x == 0:
                x[...] = preference
                preference += 1

However, if I run this code outside of the Python Console, I get the following error message:

TypeError: Iterator operand or requested dtype holds references, but the REFS_OK flag was not enabled

Is there another way to accomplish this in NumPy?

Upvotes: 2

Views: 359

Answers (6)

hpaulj
hpaulj

Reputation: 231335

Why did you use nditer? Basically you got it working, which wasn't a trivial task. But somehow missed the message that it isn't a speed tool, at least not when used in Python code. Plain iteration is usually just as good, unless you are doing some fancy broadcasting. But as the other answers show, a non-iterative approach is even better.

But let's focus on nditer:

https://numpy.org/devdocs/reference/arrays.nditer.html

Recreate your array:

In [1]: nan=np.nan                                                                     
In [2]: arr = np.array([[nan, nan, nan, nan, nan], 
   ...:        [nan, nan, nan, nan, nan], 
   ...:        [ 0., nan, nan, nan, nan], 
   ...:        [ 0., nan,  0., nan,  0.], 
   ...:        [ 0.,  0.,  0.,  0.,  0.], 
   ...:        [ 0.,  0.,  0.,  0.,  0.], 
   ...:        [nan,  0.,  0.,  0.,  0.], 
   ...:        [nan,  0., nan, nan, nan], 
...

In [3]: arrayOfZeros = arr.copy()                                                      
In [4]: arr.dtype                                                                      
Out[4]: dtype('float64')
In [5]: with np.nditer(arrayOfZeros, op_flags=['readwrite']) as y: 
   ...:         preference = 1 
   ...:         for x in y: 
   ...:             if x == 0: 
   ...:                 x[...] = preference 
   ...:                 preference += 1 
   ...:                                                                                
In [6]: arrayOfZeros                                                                   
Out[6]: 
array([[nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [ 1., nan, nan, nan, nan],
       [ 2., nan,  3., nan,  4.],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14.],
       [nan, 15., 16., 17., 18.],
       [nan, 19., nan, nan, nan],
...

OK it works - but the layout of consecutive numbers doesn't match your display. Your display is forcing all the other answers to do contortions with transpose.

If I change the dtype of the array to object I get your error:

In [7]: arrayOfZeros = arr.astype(object)                                              
In [8]: with np.nditer(arrayOfZeros, op_flags=['readwrite']) as y: 
   ...:         preference = 1 
   ...:         for x in y: 
   ...:             if x == 0: 
   ...:                 x[...] = preference 
   ...:                 preference += 1 
   ...:                                                                                
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-7dd225a24a36> in <module>
----> 1 with np.nditer(arrayOfZeros, op_flags=['readwrite']) as y:
      2         preference = 1
      3         for x in y:
      4             if x == 0:
      5                 x[...] = preference

TypeError: Iterator operand or requested dtype holds references, but the REFS_OK flag was not enabled

Making the suggest fix: https://docs.scipy.org/doc/numpy/reference/generated/numpy.nditer.html

In [10]: with np.nditer(arrayOfZeros, flags=['refs_ok'], op_flags=['readwrite']) as y: 
    ...:         preference = 1 
    ...:         for x in y: 
    ...:             if x == 0: 
    ...:                 x[...] = preference 
    ...:                 preference += 1 
    ...:                                                                               
In [11]: arrayOfZeros                                                                  
Out[11]: 
array([[nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [1, nan, nan, nan, nan],
       [2, nan, 3, nan, 4],
       [5, 6, 7, 8, 9],
       [10, 11, 12, 13, 14],
       [nan, 15, 16, 17, 18],
       [nan, 19, nan, nan, nan],

It doesn't display in neat columns because of the object dtype.

If I change the array to order='F', we get the consecutive numbers going down the columns:

In [12]: arrayOfZeros = arr.copy(order='F') 
In [14]: with np.nditer(arrayOfZeros, op_flags=['readwrite']) as y: 
    ...:                                                                               
In [15]: arrayOfZeros                                                                  
Out[15]: 
array([[nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [ 1., nan, nan, nan, nan],
       [ 2., nan, 19., nan, 39.],
       [ 3., 11., 20., 31., 40.],
       [ 4., 12., 21., 32., 41.],
       [nan, 13., 22., 33., 42.],
       [nan, 14., nan, nan, nan],
....

The order 'Fand the object dtype makes me wonder - is the source of this array apandas` Dataframe?

Upvotes: 0

Paul Panzer
Paul Panzer

Reputation: 53029

Why is everybody insisting on using the cumsum here? It's wasteful. Better:

out = arrayOfZeros.copy()
z = out==out
out.T[z.T] = np.arange(1,1+np.count_nonzero(z))

Timings:

5.025142431259155   # PP
38.67108239792287   # cumsum 1   rafaelc
9.263199986889958   # cumsum 2   Derek Eden
9.044178808107972   # cumsum 3   Onyambu
10.640528565272689  # cumsum 4   Andy L.

Code:

import numpy as np

array,nan = np.array,np.nan

x = \
array([[nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [ 0., nan, nan, nan, nan],
       [ 0., nan,  0., nan,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [nan,  0.,  0.,  0.,  0.],
       [nan,  0., nan, nan, nan],
       [nan, nan,  0., nan, nan],
       [ 0., nan,  0., nan,  0.],
       [ 0., nan,  0., nan,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [nan, nan,  0.,  0.,  0.],
       [nan, nan, nan, nan,  0.],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan]])

from timeit import timeit

def f_pp():
    out = x.copy()
    z = out==out
    out.T[z.T] = np.arange(1,1+np.count_nonzero(z))
    return out

def f_cumsum():
    arr = x.copy()
    mask = ~np.isnan(arr)
    arr[mask] = np.nan_to_num(arr + 1).ravel('F').cumsum().reshape(arr.shape, order='F')[mask]
    return arr

def f_cumsum_2():
    arr = x.copy()
    in_arr = arr.T
    fill = (in_arr==0).cumsum().reshape(in_arr.shape)
    return (in_arr + fill).T

def f_cumsum_3():
    arrayOfZeros = x.copy()
    mask = arrayOfZeros==0
    arrayOfZeros.T[mask.T] = mask.T.cumsum()[mask.T.flatten()]
    return arrayOfZeros

def f_cumsum_4():
    arrayOfZeros = x.copy()
    m = (arrayOfZeros == 0)
    a = (arrayOfZeros.T == 0).cumsum().reshape(-1, arrayOfZeros.shape[0]).T
    arrayOfZeros[m] = a[m]
    return arrayOfZeros

assert(np.nan_to_num(f_pp()) == np.nan_to_num(f_cumsum())).all()
assert(np.nan_to_num(f_pp()) == np.nan_to_num(f_cumsum_2())).all()
assert(np.nan_to_num(f_pp()) == np.nan_to_num(f_cumsum_3())).all()
assert(np.nan_to_num(f_pp()) == np.nan_to_num(f_cumsum_4())).all()

for f in (f_pp,f_cumsum,f_cumsum_2,f_cumsum_3,f_cumsum_4):
    print(timeit(f,number=10000)*100)

Upvotes: 2

Andy L.
Andy L.

Reputation: 25239

create True mask m on 0. Use transpose, cumsum, reshape to create array of increment of 0. Finally, assign through mask m

m = (arrayOfZeros == 0)
a = (arrayOfZeros.T == 0).cumsum().reshape(-1, arrayOfZeros.shape[0]).T
arrayOfZeros[m] = a[m]

Out[353]:
array([[nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [ 1., nan, nan, nan, nan],
       [ 2., nan, 19., nan, 39.],
       [ 3., 11., 20., 31., 40.],
       [ 4., 12., 21., 32., 41.],
       [nan, 13., 22., 33., 42.],
       [nan, 14., nan, nan, nan],
       [nan, nan, 23., nan, nan],
       [ 5., nan, 24., nan, 43.],
       [ 6., nan, 25., nan, 44.],
       [ 7., 15., 26., 34., 45.],
       [ 8., 16., 27., 35., 46.],
       [ 9., 17., 28., 36., 47.],
       [10., 18., 29., 37., 48.],
       [nan, nan, 30., 38., 49.],
       [nan, nan, nan, nan, 50.],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan]])

Upvotes: 0

Onyambu
Onyambu

Reputation: 79188

mask = arrayOfZeros==0
arrayOfZeros.T[mask.T] = mask.T.cumsum()[mask.T.flatten()]

array([[nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [ 1., nan, nan, nan, nan],
       [ 2., nan, 19., nan, 39.],
       [ 3., 11., 20., 31., 40.],
       [ 4., 12., 21., 32., 41.],
       [nan, 13., 22., 33., 42.],
       [nan, 14., nan, nan, nan],
       [nan, nan, 23., nan, nan],.....

Upvotes: 0

Derek Eden
Derek Eden

Reputation: 4618

could also do this approach:

arr #just for example

array([[ 0., nan,  0., nan, nan,  0.,  0.],
       [ 0.,  0.,  0., nan, nan, nan,  0.]])

in_arr = arr.T
fill = (in_arr==0).cumsum().reshape(in_arr.shape)
out_arr = (in_arr + fill).T

output:

array([[ 1., nan,  4., nan, nan,  6.,  7.],
       [ 2.,  3.,  5., nan, nan, nan,  8.]])

Upvotes: 0

rafaelc
rafaelc

Reputation: 59264

Use broadcasting. Save the mask with isnan, and ravel() with 'F' ordering + cumsum for vectorized summation.

mask = ~np.isnan(arr)
arr[mask] = np.nan_to_num(arr + 1).ravel('F').cumsum().reshape(a.shape, order='F')[mask]

Since you tagged pandas, if you have a df you may cumsum directly since it skips nan.

pd.DataFrame(arr.ravel('F')).add(1).cumsum().to_numpy().reshape(a.shape, order='F')

Upvotes: 2

Related Questions