Compare with next row repeatedly until criteria met in Pandas

Question

I would like to filter a dataframe df:

            Id          Timestamp               Data    diff1
10856167    18675685    2010-03-01 05:58:15.520 25.0    0.0
10856168    18675686    2010-03-01 05:58:16.863 26.0    1.0
10856169    18675687    2010-03-01 05:58:18.203 30.5    4.5
10856170    18675688    2010-03-01 05:58:19.543 40.5    10.0
10856171    18675689    2010-03-01 05:58:20.877 42.0    1.5
10856172    18675690    2010-03-01 05:58:22.223 43.0    1.0
10856175    18675693    2010-03-01 05:58:41.127 42.5    -0.5
10856176    18675694    2010-03-01 05:58:42.503 42.0    -0.5
10856177    18675695    2010-03-01 05:58:49.313 42.5    0.5
10856178    18675696    2010-03-01 05:58:50.663 43.0    0.5
10856181    18675699    2010-03-01 05:59:01.443 43.5    0.5
10856182    18675700    2010-03-01 05:59:02.797 42.0    -1.5
10856183    18675701    2010-03-01 05:59:04.153 41.5    -0.5
10856184    18675702    2010-03-01 05:59:05.497 41.0    -0.5
10856185    18675703    2010-03-01 05:59:29.880 41.5    0.5
10856186    18675704    2010-03-01 05:59:31.220 42.0    0.5
10856191    18675709    2010-03-01 05:59:42.053 42.5    0.5
10856192    18675710    2010-03-01 05:59:43.407 43.0    0.5
10856193    18675711    2010-03-01 05:59:44.753 42.0    -1.0
10856218    18675736    2010-03-01 06:05:21.360 41.5    -0.5

by comparing df['Data'] of current row with the next row. If the absolute difference between the values is greater than 1, keep the next row, which becomes the new current row; otherwise, remove the next row and compare with the following row until we find a row that satisfies the condition. I tried diff() and shift() but they only compare adjacent rows.

So expected output:


            Id          Timestamp               Data    diff1
10856167    18675685    2010-03-01 05:58:15.520 25.0    0.0

10856169    18675687    2010-03-01 05:58:18.203 30.5    4.5
10856170    18675688    2010-03-01 05:58:19.543 40.5    10.0
10856171    18675689    2010-03-01 05:58:20.877 42.0    1.5





10856181    18675699    2010-03-01 05:59:01.443 43.5    0.5
10856182    18675700    2010-03-01 05:59:02.797 42.0    -1.5

What is the best way to do this?

Update

Tried:

from numba import njit
@njit
def f(x, lim):
    total = x[0]
    result = np.empty(len(x), dtype=bool)
    result[0] = True
    for j,i in enumerate(x[1:], 1):
        if abs(total - i) <= lim:
            result[j] = False
        else:
            total = i
            result[j] = True

    return result

N = 1
df1 = sample[f(sample.Data.values, N)]
print(df1)

and caught error:

---------------------------------------------------------------------------
TypingError                               Traceback (most recent call last)
 in 
     16 
     17 N = 1
---> 18 df1 = sample[f(sample.Data.values, N)]
     19 print(df1)

~/opt/anaconda3/lib/python3.7/site-packages/numba/core/dispatcher.py in _compile_for_args(self, *args, **kws)
    399                 e.patch_message(msg)
    400 
--> 401             error_rewrite(e, 'typing')
    402         except errors.UnsupportedError as e:
    403             # Something unsupported is present in the user code, add help info

~/opt/anaconda3/lib/python3.7/site-packages/numba/core/dispatcher.py in error_rewrite(e, issue_type)
    342                 raise e
    343             else:
--> 344                 reraise(type(e), e, None)
    345 
    346         argtypes = []

~/opt/anaconda3/lib/python3.7/site-packages/numba/core/utils.py in reraise(tp, value, tb)
     77         value = tp()
     78     if value.__traceback__ is not tb:
---> 79         raise value.with_traceback(tb)
     80     raise value
     81 

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type array(pyobject, 1d, C)
[1] During: typing of argument at  (5)

File "", line 5:
def f(x, lim):
    total = x[0]
    ^

jezrael · Accepted Answer

I think numba is way how working with loops here if performance is important:

from numba import njit
@njit
def f(x, lim):
    total = x[0]
    result = np.empty(len(x), dtype=np.bool8)
    result[0] = True
    for j,i in enumerate(x[1:], 1):
        if abs(total - i) <= lim:
            result[j] = False
        else:
            total = i
            result[j] = True

    return result

N = 1
df1 = sample[f(sample.Data.values, N)]
print(df1)
                Id                Timestamp  Data  diff1
10856167  18675685  2010-03-01 05:58:15.520  25.0    0.0
10856169  18675687  2010-03-01 05:58:18.203  30.5    4.5
10856170  18675688  2010-03-01 05:58:19.543  40.5   10.0
10856171  18675689  2010-03-01 05:58:20.877  42.0    1.5
10856181  18675699  2010-03-01 05:59:01.443  43.5    0.5
10856182  18675700  2010-03-01 05:59:02.797  42.0   -1.5

Compare with next row repeatedly until criteria met in Pandas

Answers (2)

Related Questions