Reputation: 5135
I would like to filter a dataframe df
:
Id Timestamp Data diff1
10856167 18675685 2010-03-01 05:58:15.520 25.0 0.0
10856168 18675686 2010-03-01 05:58:16.863 26.0 1.0
10856169 18675687 2010-03-01 05:58:18.203 30.5 4.5
10856170 18675688 2010-03-01 05:58:19.543 40.5 10.0
10856171 18675689 2010-03-01 05:58:20.877 42.0 1.5
10856172 18675690 2010-03-01 05:58:22.223 43.0 1.0
10856175 18675693 2010-03-01 05:58:41.127 42.5 -0.5
10856176 18675694 2010-03-01 05:58:42.503 42.0 -0.5
10856177 18675695 2010-03-01 05:58:49.313 42.5 0.5
10856178 18675696 2010-03-01 05:58:50.663 43.0 0.5
10856181 18675699 2010-03-01 05:59:01.443 43.5 0.5
10856182 18675700 2010-03-01 05:59:02.797 42.0 -1.5
10856183 18675701 2010-03-01 05:59:04.153 41.5 -0.5
10856184 18675702 2010-03-01 05:59:05.497 41.0 -0.5
10856185 18675703 2010-03-01 05:59:29.880 41.5 0.5
10856186 18675704 2010-03-01 05:59:31.220 42.0 0.5
10856191 18675709 2010-03-01 05:59:42.053 42.5 0.5
10856192 18675710 2010-03-01 05:59:43.407 43.0 0.5
10856193 18675711 2010-03-01 05:59:44.753 42.0 -1.0
10856218 18675736 2010-03-01 06:05:21.360 41.5 -0.5
by comparing df['Data']
of current row with the next row. If the absolute difference between the values is greater than 1, keep the next row, which becomes the new current row; otherwise, remove the next row and compare with the following row until we find a row that satisfies the condition. I tried diff()
and shift()
but they only compare adjacent rows.
So expected output:
Id Timestamp Data diff1
10856167 18675685 2010-03-01 05:58:15.520 25.0 0.0
10856169 18675687 2010-03-01 05:58:18.203 30.5 4.5
10856170 18675688 2010-03-01 05:58:19.543 40.5 10.0
10856171 18675689 2010-03-01 05:58:20.877 42.0 1.5
10856181 18675699 2010-03-01 05:59:01.443 43.5 0.5
10856182 18675700 2010-03-01 05:59:02.797 42.0 -1.5
What is the best way to do this?
Update
Tried:
from numba import njit
@njit
def f(x, lim):
total = x[0]
result = np.empty(len(x), dtype=bool)
result[0] = True
for j,i in enumerate(x[1:], 1):
if abs(total - i) <= lim:
result[j] = False
else:
total = i
result[j] = True
return result
N = 1
df1 = sample[f(sample.Data.values, N)]
print(df1)
and caught error:
---------------------------------------------------------------------------
TypingError Traceback (most recent call last)
<ipython-input-15-db6a2b8a2896> in <module>
16
17 N = 1
---> 18 df1 = sample[f(sample.Data.values, N)]
19 print(df1)
~/opt/anaconda3/lib/python3.7/site-packages/numba/core/dispatcher.py in _compile_for_args(self, *args, **kws)
399 e.patch_message(msg)
400
--> 401 error_rewrite(e, 'typing')
402 except errors.UnsupportedError as e:
403 # Something unsupported is present in the user code, add help info
~/opt/anaconda3/lib/python3.7/site-packages/numba/core/dispatcher.py in error_rewrite(e, issue_type)
342 raise e
343 else:
--> 344 reraise(type(e), e, None)
345
346 argtypes = []
~/opt/anaconda3/lib/python3.7/site-packages/numba/core/utils.py in reraise(tp, value, tb)
77 value = tp()
78 if value.__traceback__ is not tb:
---> 79 raise value.with_traceback(tb)
80 raise value
81
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type array(pyobject, 1d, C)
[1] During: typing of argument at <ipython-input-15-db6a2b8a2896> (5)
File "<ipython-input-15-db6a2b8a2896>", line 5:
def f(x, lim):
total = x[0]
^
Upvotes: 1
Views: 233
Reputation: 863166
I think numba is way how working with loops here if performance is important:
from numba import njit
@njit
def f(x, lim):
total = x[0]
result = np.empty(len(x), dtype=np.bool8)
result[0] = True
for j,i in enumerate(x[1:], 1):
if abs(total - i) <= lim:
result[j] = False
else:
total = i
result[j] = True
return result
N = 1
df1 = sample[f(sample.Data.values, N)]
print(df1)
Id Timestamp Data diff1
10856167 18675685 2010-03-01 05:58:15.520 25.0 0.0
10856169 18675687 2010-03-01 05:58:18.203 30.5 4.5
10856170 18675688 2010-03-01 05:58:19.543 40.5 10.0
10856171 18675689 2010-03-01 05:58:20.877 42.0 1.5
10856181 18675699 2010-03-01 05:59:01.443 43.5 0.5
10856182 18675700 2010-03-01 05:59:02.797 42.0 -1.5
Upvotes: 1
Reputation: 16683
i = 0
for row in range(2, len(df)):
i += 1
if i <= len(df) - 1:
if -1 <= df.iloc[i,:]['Data'] - df.iloc[i-1,:]['Data'] <= 1:
df.iloc[i,3] = ''
df.loc[df['diff1'] == '',:] = ''
Output:
Id Timestamp Data diff1
10856167 18675685 2010-03-01 05:58:15.520 25 0
10856168
10856169 18675687 2010-03-01 05:58:18.203 30.5 4.5
10856170 18675688 2010-03-01 05:58:19.543 40.5 10
10856171 18675689 2010-03-01 05:58:20.877 42 1.5
10856172
10856175
10856176
10856177
10856178
10856181
10856182 18675700 2010-03-01 05:59:02.797 42 -1.5
10856183
10856184
10856185
10856186
10856191
10856192
10856193
10856218 18675736 2010-03-01 06:05:21.360 41.5 -0.5
Upvotes: 0