Reputation: 3729
I have a big dataset (> 200k) and I am trying to replace zero sequences with a value. A zero sequence with more than 2 zeros is an artifact and should be removed by setting it to np.NAN.
I have read Searching a sequence in a NumPy array but it did not fully match my requirement, as i do not have static pattern.
np.array([0, 1.0, 0, 0, -6.0, 13.0, 0, 0, 0, 1.0, 16.0, 0, 0, 0, 0, 1.0, 1.0, 1.0, 1.0])
# should be converted to this
np.array([0, 1.0, 0, 0, -6.0, 13.0, NaN, NaN, NaN, 1.0, 16.0, NaN, NaN, NaN, NaN, 1.0, 1.0, 1.0, 1.0])
If you need some more information, let me know. Thanks in advance!
Thanks for the answers, here are my (unprofessional) test results running on 288240 points
divakar took 0.016000ms to replace 87912 points
desiato took 0.076000ms to replace 87912 points
polarise took 0.102000ms to replace 87912 points
As @Divakar's solution is the shortest and fastest I accept his one.
Upvotes: 4
Views: 340
Reputation: 1152
you could use groupby of the itertools package
import numpy as np
from itertools import groupby
l = np.array([0, 1, 0, 0, -6, 13, 0, 0, 0, 1, 16, 0, 0, 0, 0])
def _ret_list( k, it ):
# number of elements in iterator, i.e., length of list of similar items
l = sum( 1 for i in it )
if k==0 and l>2:
# sublist has more than two zeros. replace each zero by np.nan
return [ np.nan ]*l
else:
# return sublist of simliar items
return [ k ]*l
# group items and apply _ret_list on each group
procesed_l = [_ret_list(k,g) for k,g in groupby(l)]
# flatten the list and convert to a numpy array
procesed_l = np.array( [ item for l in procesed_l for item in l ] )
print procesed_l
which gives you
[ 0. 1. 0. 0. -6. 13. nan nan nan 1. 16. nan nan nan nan]
note that each int
are converted to a float
. see here: NumPy or Pandas: Keeping array type as integer while having a NaN value
Upvotes: 1
Reputation: 221674
Well that's basically a binary closing operation
with a threshold requirement on the closing gap. Here's an implementation based on it -
# Pad with ones so as to make binary closing work around the boundaries too
a_extm = np.hstack((True,a!=0,True))
# Perform binary closing and look for the ones that have not changed indiicating
# the gaps in those cases were above the threshold requirement for closing
mask = a_extm == binary_closing(a_extm,structure=np.ones(3))
# Out of those avoid the 1s from the original array and set rest as NaNs
out = np.where(~a_extm[1:-1] & mask[1:-1],np.nan,a)
One way to avoid that appending in the earlier method as needed to work with boundary elements, which might make it a bit expensive when dealing with large dataset, would be like so -
# Create binary closed mask
mask = ~binary_closing(a!=0,structure=np.ones(3))
idx = np.where(a)[0]
mask[:idx[0]] = idx[0]>=3
mask[idx[-1]+1:] = a.size - idx[-1] -1 >=3
# Use the mask to set NaNs in a
out = np.where(mask,np.nan,a)
Upvotes: 3
Reputation: 2413
Here is a function you can use for your lists:
import numpy as np
def replace(a_list):
for i in xrange(len(a_list) - 2):
print a_list[i:i+3]
if (a_list[i] == 0 and a_list[i+1] == 0 and a_list[i+2] == 0) or (a_list[i] is np.NaN and a_list[i+1] is np.NaN and a_list[i+2] == 0):
a_list[i] = np.NaN
a_list[i+1] = np.NaN
a_list[i+2] = np.NaN
return a_list
Because the list is traversed in one direction you only have two comparisons: (0, 0, 0)
or (NaN, NaN, 0)
because you replace 0
with NaN
as you go.
Upvotes: 1