ppasler
ppasler

Reputation: 3729

Replace a zero sequence with other value

I have a big dataset (> 200k) and I am trying to replace zero sequences with a value. A zero sequence with more than 2 zeros is an artifact and should be removed by setting it to np.NAN.

I have read Searching a sequence in a NumPy array but it did not fully match my requirement, as i do not have static pattern.

np.array([0, 1.0, 0, 0, -6.0, 13.0, 0, 0, 0, 1.0, 16.0, 0, 0, 0, 0, 1.0, 1.0, 1.0, 1.0])
# should be converted to this
np.array([0, 1.0, 0, 0, -6.0, 13.0, NaN, NaN, NaN, 1.0, 16.0, NaN, NaN, NaN, NaN, 1.0, 1.0, 1.0, 1.0])    

If you need some more information, let me know. Thanks in advance!


Results:

Thanks for the answers, here are my (unprofessional) test results running on 288240 points

divakar took 0.016000ms to replace 87912 points
desiato took 0.076000ms to replace 87912 points
polarise took 0.102000ms to replace 87912 points

As @Divakar's solution is the shortest and fastest I accept his one.

Upvotes: 4

Views: 340

Answers (3)

desiato
desiato

Reputation: 1152

you could use groupby of the itertools package

import numpy as np
from itertools import groupby

l = np.array([0, 1, 0, 0, -6, 13, 0, 0, 0, 1, 16, 0, 0, 0, 0])

def _ret_list( k, it ):
    # number of elements in iterator, i.e., length of list of similar items
    l = sum( 1 for i in it )

    if k==0 and l>2:
        # sublist has more than two zeros. replace each zero by np.nan
        return [ np.nan ]*l
    else:
        # return sublist of simliar items
        return [ k ]*l

# group items and apply _ret_list on each group
procesed_l = [_ret_list(k,g) for k,g in groupby(l)]
# flatten the list and convert to a numpy array
procesed_l = np.array( [ item for l in procesed_l for item in l ] )

print procesed_l

which gives you

[  0.   1.   0.   0.  -6.  13.  nan  nan  nan   1.  16.  nan  nan  nan  nan]

note that each int are converted to a float. see here: NumPy or Pandas: Keeping array type as integer while having a NaN value

Upvotes: 1

Divakar
Divakar

Reputation: 221674

Well that's basically a binary closing operation with a threshold requirement on the closing gap. Here's an implementation based on it -

# Pad with ones so as to make binary closing work around the boundaries too
a_extm = np.hstack((True,a!=0,True))

# Perform binary closing and look for the ones that have not changed indiicating
# the gaps in those cases were above the threshold requirement for closing
mask = a_extm == binary_closing(a_extm,structure=np.ones(3))

# Out of those avoid the 1s from the original array and set rest as NaNs
out = np.where(~a_extm[1:-1] & mask[1:-1],np.nan,a)

One way to avoid that appending in the earlier method as needed to work with boundary elements, which might make it a bit expensive when dealing with large dataset, would be like so -

# Create binary closed mask
mask = ~binary_closing(a!=0,structure=np.ones(3))
idx = np.where(a)[0]
mask[:idx[0]] = idx[0]>=3
mask[idx[-1]+1:] = a.size - idx[-1] -1 >=3

# Use the mask to set NaNs in a
out = np.where(mask,np.nan,a)

Upvotes: 3

polarise
polarise

Reputation: 2413

Here is a function you can use for your lists:

import numpy as np

def replace(a_list):
    for i in xrange(len(a_list) - 2):
        print a_list[i:i+3]
        if (a_list[i] == 0 and a_list[i+1] == 0 and a_list[i+2] == 0) or (a_list[i] is np.NaN and a_list[i+1] is np.NaN and a_list[i+2] == 0):
            a_list[i] = np.NaN
            a_list[i+1] = np.NaN
            a_list[i+2] = np.NaN
    return a_list

Because the list is traversed in one direction you only have two comparisons: (0, 0, 0) or (NaN, NaN, 0) because you replace 0 with NaN as you go.

Upvotes: 1

Related Questions