Reputation: 285
I have the following column from a data file which I am trying to plot.
[ 2.21 2.34 2.56 2.78 180. 3.32 4.57 2.89 286.
2.46 3.76 4.89 10.13]
So, in my datasets sometimes I have this drastically sharp increase in values like (2.78 180 3.32) & (2.89 286 2.46). I want to replace this abnormal values with np.nan. I am trying to input a condition like this [if x(i)>5(x(i-1)+x(i+1)), then x(i)=np.nan] which means whenever a i-th value of x (x being the column values) is so much greater than its previous and next values, python will replace the value with np.nan, so that it doesn't get plotted or considered. But I haven't been able to put that in coding format. Any help would be so much helpful.
import numpy as np
data=np.loadtxt('/Users/Hrihaan/Desktop/Data.txt')
x=data[:,1]
print(x)
Upvotes: 1
Views: 2043
Reputation: 880079
The condition x(i)>5(x(i-1)+x(i+1))
can be tested for i = 1,...,n-1
, where n
is the largest allowable index of x
.
A vectorized version which tests this condition for all i
s would be:
mask = (x[1:-1] > 5*(x[2:]+x[:-2]))
And you could then assign np.nan
to those locations where the mask
is True by using:
x[1:-1][mask] = np.nan
Note that x[1:-1]
is a slice of x
-- and that is important because slices (as opposed to arrays obtained through so-called "advanced indexing") are views of the original array, x
. So modifying the view, x[1:-1]
, affects the original array x
. Thus, assigning to x[1:-1][mask]
affects not only the slice x[1:-1]
but x
itself.
Indexing with a boolean mask invokes advanced indexing which returns a new array (not a view). So in contrast, the assignment x[mask][1:-1] = np.nan
would not work because modifying x[mask]
would not affect x
itself. (It also would not work for a more mundane reason -- mask
is the wrong length.)
Let's give it a try:
import numpy as np
x = np.array([ 2.21, 2.34, 2.56, 2.78, 180., 3.32, 4.57, 2.89, 286., 2.46, 3.76, 4.89, 10.13])
mask = (x[1:-1] > 5*(x[2:]+x[:-2]))
# array([False, False, False, True, False, False, False, True, False,
# False, False], dtype=bool)
x[1:-1][mask] = np.nan
print(x)
# array([ 2.21, 2.34, 2.56, 2.78, nan, 3.32, 4.57, 2.89,
# nan, 2.46, 3.76, 4.89, 10.13])
To better understand (x[1:-1] > 5*(x[2:]+x[:-2]))
it helps to look at a simplified example:
In [57]: x = np.arange(8); x
Out[57]: array([0, 1, 2, 3, 4, 5, 6, 7])
x[2:]
slices off the first two items from x
:
In [58]: x[2:]
Out[58]: array([2, 3, 4, 5, 6, 7])
x[:-2]
slices off the last two items from x
:
In [59]: x[:-2]
Out[59]: array([0, 1, 2, 3, 4, 5])
x[1:-1]
slices of the first and last items from x
:
In [60]: x[1:-1]
Out[60]: array([1, 2, 3, 4, 5, 6])
NumPy arithmetic is performed element-wise. So (x[2:]+x[:-2])
computes x(i-1)+x(i+1)
for i=1,...,n-1
:
In [61]: (x[2:]+x[:-2])
Out[61]: array([ 2, 4, 6, 8, 10, 12])
So we have this situation:
| i | x(i-1) | x(i+1) | x(i) |
|-----+--------+--------+--------|
| 1 | x(0) | x(2) | x(1) |
| 2 | x(1) | x(3) | x(2) |
| 3 | x(2) | x(4) | x(3) |
| ... | | | |
| n-1 | x(n-1) | x(n) | x(n-1) |
|-----+--------+--------+--------|
^ ^ ^
| | |
| | o--- This column is the array x[1:-1]
| |
| o------------ This column is the array x[2:]
|
o--------------------- This column is the array x[:-2]
Another way to look it is: once you know the condition is for i=1,...,n-1
, then x(i)
obviously becomes x[1:-1]
since it starts at index 1 and ends 1 index before the last possible index.
Next, x(i-1)
and x(i+1)
can be thought of as the elements to the left and right of x(i)
. So we are dealing with x[1:-1]
shifted by one index to the left and one index to the right.
So shifting x[1:-1]
by one index to the right produces x[2:]
and shifting x[1:-1]
by one index to the left produces x[:-2]
.
By the way, one of the beautiful
properties of
Python's half-open slice syntax is that x[a:b]
has (b-a)
elements. So
x[1:-1]
(which is equivalent to x[1:n-1]
) has n-2
elements. Noting that
there are 2 missing elements makes it easy to guess that arrays adjacent to
x[1:-1]
are x[2:]
and x[:-2]
.
Upvotes: 3
Reputation: 8378
If occurrences of abnormal values are rare (abnormal == rare kind of by definition), then using integer indexing instead of the boolean indexing used in @unutbu's answer would be significantly more efficient, especially in large arrays:
import numpy as np
x = np.array([ 2.21, 2.34, 2.56, 2.78, 180., 3.32, 4.57, 2.89, 286., 2.46, 3.76, 4.89, 10.13])
xp = np.pad(x, 1, 'reflect') # to deal with boundaries
idx = np.where(x > 5*(xp[2:]+xp[:-2]))
x[idx] = np.nan
Upvotes: 2