Reputation: 555
I have a dataframe like the following:
A
1 1000
2 1000
3 1001
4 1001
5 10
6 1000
7 1010
8 9
9 10
10 6
11 999
12 10110
13 10111
14 1000
I am trying to clean my dataframe in the following way: For every row having more value than 1.5 times the previous row value or less than 0.5 times the previous row value, drop it. But If the previous row is a to-drop row, comparison must be made with the immediate previous NON-to-drop row. (For example Index 9, 10 or 13 in my dataframe) So the final dataframe should be like:
A
1 1000
2 1000
3 1001
4 1001
6 1000
7 1010
11 999
14 1000
My dataframe is really huge so performance is appreciated.
Upvotes: 6
Views: 2075
Reputation: 61910
One alternative could be to use itertools.accumulate to push forward the last valid value and then filter out the values that are different from the original, e.g:
from itertools import accumulate
def change(x, y, pct=0.5):
if pct * x <= y <= (1 + pct) * x:
return y
return x
# create a mask filtering out the values that are different from the original A
mask = (df.A == list(accumulate(df.A, change)))
print(df[mask])
Output
A
1 1000
2 1000
3 1001
4 1001
6 1000
7 1010
11 999
14 1000
Just to get an idea, see how the accumulated column (change) compares to the original side-by-side:
A change
1 1000 1000
2 1000 1000
3 1001 1001
4 1001 1001
5 10 1001
6 1000 1000
7 1010 1010
8 9 1010
9 10 1010
10 6 1010
11 999 999
12 10110 999
13 10111 999
14 1000 1000
Update
To make it in the function call do:
mask = (df.A == list(accumulate(df.A, lambda x, y : change(x, y, pct=0.5))))
Upvotes: 1
Reputation: 294228
I'll pass a series to a function and yield the index values for which rows satisfy the conditions.
def f(s):
it = s.iteritems()
i, v = next(it)
yield i # Yield the first one
for j, x in it:
if .5 * v <= x <= 1.5 * v:
yield j # Yield the ones that satisfy
v = x # Update the comparative value
df.loc[list(f(df.A))] # Use `loc` with index values
# yielded by my generator
A
1 1000
2 1000
3 1001
4 1001
6 1000
7 1010
11 999
14 1000
Upvotes: 7