Hrihaan
Hrihaan

Reputation: 285

Replacing abnormally large values with nan in a numpy array

I have the following column from a data file which I am trying to plot.

[   2.21    2.34    2.56    2.78  180.      3.32    4.57    2.89  286.
    2.46    3.76    4.89   10.13]

So, in my datasets sometimes I have this drastically sharp increase in values like (2.78 180 3.32) & (2.89 286 2.46). I want to replace this abnormal values with np.nan. I am trying to input a condition like this [if x(i)>5(x(i-1)+x(i+1)), then x(i)=np.nan] which means whenever a i-th value of x (x being the column values) is so much greater than its previous and next values, python will replace the value with np.nan, so that it doesn't get plotted or considered. But I haven't been able to put that in coding format. Any help would be so much helpful.

import numpy as np
data=np.loadtxt('/Users/Hrihaan/Desktop/Data.txt')
x=data[:,1]
print(x)

Upvotes: 1

Views: 2043

Answers (2)

unutbu
unutbu

Reputation: 880079

The condition x(i)>5(x(i-1)+x(i+1)) can be tested for i = 1,...,n-1, where n is the largest allowable index of x. A vectorized version which tests this condition for all is would be:

mask = (x[1:-1] > 5*(x[2:]+x[:-2]))

And you could then assign np.nan to those locations where the mask is True by using:

x[1:-1][mask] = np.nan

Note that x[1:-1] is a slice of x -- and that is important because slices (as opposed to arrays obtained through so-called "advanced indexing") are views of the original array, x. So modifying the view, x[1:-1], affects the original array x. Thus, assigning to x[1:-1][mask] affects not only the slice x[1:-1] but x itself.

Indexing with a boolean mask invokes advanced indexing which returns a new array (not a view). So in contrast, the assignment x[mask][1:-1] = np.nan would not work because modifying x[mask] would not affect x itself. (It also would not work for a more mundane reason -- mask is the wrong length.)


Let's give it a try:

import numpy as np
x = np.array([ 2.21, 2.34, 2.56, 2.78, 180., 3.32, 4.57, 2.89, 286., 2.46, 3.76, 4.89, 10.13])
mask = (x[1:-1] > 5*(x[2:]+x[:-2]))
# array([False, False, False,  True, False, False, False,  True, False,
#        False, False], dtype=bool)
x[1:-1][mask] = np.nan

print(x)
# array([  2.21,   2.34,   2.56,   2.78,    nan,   3.32,   4.57,   2.89,
#         nan,   2.46,   3.76,   4.89,  10.13])

To better understand (x[1:-1] > 5*(x[2:]+x[:-2])) it helps to look at a simplified example:

In [57]: x = np.arange(8); x
Out[57]: array([0, 1, 2, 3, 4, 5, 6, 7])

x[2:] slices off the first two items from x:

In [58]: x[2:]
Out[58]: array([2, 3, 4, 5, 6, 7])

x[:-2] slices off the last two items from x:

In [59]: x[:-2]
Out[59]: array([0, 1, 2, 3, 4, 5])

x[1:-1] slices of the first and last items from x:

In [60]: x[1:-1]
Out[60]: array([1, 2, 3, 4, 5, 6])

NumPy arithmetic is performed element-wise. So (x[2:]+x[:-2]) computes x(i-1)+x(i+1) for i=1,...,n-1:

In [61]: (x[2:]+x[:-2])
Out[61]: array([ 2,  4,  6,  8, 10, 12])

So we have this situation:

|   i | x(i-1) | x(i+1) | x(i)   |
|-----+--------+--------+--------|
|   1 | x(0)   | x(2)   | x(1)   |
|   2 | x(1)   | x(3)   | x(2)   |
|   3 | x(2)   | x(4)   | x(3)   |
| ... |        |        |        |
| n-1 | x(n-1) | x(n)   | x(n-1) |
|-----+--------+--------+--------|
        ^        ^        ^
        |        |        |
        |        |        o--- This column is the array x[1:-1]
        |        |
        |        o------------ This column is the array x[2:]
        |
        o--------------------- This column is the array x[:-2]

Another way to look it is: once you know the condition is for i=1,...,n-1, then x(i) obviously becomes x[1:-1] since it starts at index 1 and ends 1 index before the last possible index. Next, x(i-1) and x(i+1) can be thought of as the elements to the left and right of x(i). So we are dealing with x[1:-1] shifted by one index to the left and one index to the right. So shifting x[1:-1] by one index to the right produces x[2:] and shifting x[1:-1] by one index to the left produces x[:-2].


By the way, one of the beautiful properties of Python's half-open slice syntax is that x[a:b] has (b-a) elements. So x[1:-1] (which is equivalent to x[1:n-1]) has n-2 elements. Noting that there are 2 missing elements makes it easy to guess that arrays adjacent to x[1:-1] are x[2:] and x[:-2].

Upvotes: 3

AGN Gazer
AGN Gazer

Reputation: 8378

If occurrences of abnormal values are rare (abnormal == rare kind of by definition), then using integer indexing instead of the boolean indexing used in @unutbu's answer would be significantly more efficient, especially in large arrays:

import numpy as np
x = np.array([ 2.21, 2.34, 2.56, 2.78, 180., 3.32, 4.57, 2.89, 286., 2.46, 3.76, 4.89, 10.13])
xp = np.pad(x, 1, 'reflect') # to deal with boundaries
idx = np.where(x > 5*(xp[2:]+xp[:-2]))
x[idx] = np.nan

Upvotes: 2

Related Questions