Reputation: 43

Extract groups of consecutive values having greater than specified size

I am trying to find within a dataframe if there are at least X consecutive operations (I already included a column "Filter_OK" that calculates if the row meets the criteria), and extract that group of rows.

      TRN     TRN_DATE          FILTER_OK  
0   5153    04/04/2017 11:40:00      True
1   7542    04/04/2017 17:18:00      True
2   875     04/04/2017 20:08:00      True
3   74      05/04/2017 20:30:00     False
4   9652    06/04/2017 20:32:00      True
5   965     07/04/2017 12:52:00      True
6   752     10/04/2017 17:40:00      True
7   9541    10/04/2017 19:29:00      True
8   7452    11/04/2017 12:20:00      True
9   9651    12/04/2017 13:57:00     False

For this example, if I am looking for 4 operations.
OUTPUT DESIRED:

    TRN     TRN_DATE    FILTER_OK  
4   9652    06/04/2017  20:32:00    True 
5   965     07/04/2017  12:52:00    True
6   752     10/04/2017  17:40:00    True
7   9541    10/04/2017  19:29:00    True
8   7452    11/04/2017  12:20:00    True

How can i subset the operations I need?

Upvotes: 2

Answers (4)

MarP

Reputation: 43

This is actually part of a "group by" operation (by CRD Column). If there are two consecutive groups of rows (Crd 111 and 333), and the second group of rows does not meet the condition (not 4 consecutive True), the first row of the group is included (the bold line), when it shouldn't

CRD     TRN     TRN_DATE            FILTER_OK

0    111    5153    04/04/2017 11:40:00     True

1       111     7542    04/04/2017 17:18:00     True

2       256     875     04/04/2017 20:08:00     True

3       365     74      05/04/2017 20:30:00     False

4       111     9652    06/04/2017 20:32:00     True

5       111     965     07/04/2017 12:52:00     True

6       111     752     10/04/2017 17:40:00     True

7       111     9541    10/04/2017 19:29:00     True

**8     333     7452    11/04/2017 12:20:00     True**

9       333     9651    12/04/2017 13:57:00     False

10      333     961     12/04/2017 13:57:00     False

11      333     871     12/04/2017 13:57:00     False

Actual output:

    CRD  TRN     TRN_DATE          FILTER_OK  
4   111  9652    06/04/2017 20:32:00      True

5   111  965     07/04/2017 12:52:00      True

6   111  752     10/04/2017 17:40:00      True

7   111  9541    10/04/2017 19:29:00      True

**8   333  7452    11/04/2017 12:20:00      True**

Desired output:

    CRD  TRN     TRN_DATE          FILTER_OK  
4   111  9652    06/04/2017 20:32:00      True

5   111  965     07/04/2017 12:52:00      True

6   111  752     10/04/2017 17:40:00      True

7   111  9541    10/04/2017 19:29:00      True

Upvotes: 0

Valdi_Bo

Reputation: 30971

One of possible options is to use itertools.groupby called on source df.values.

An important difference of this method, compared to pd.groupby is that if groupping key changes, then a new group is created.

So you can try the following code:

import pandas as pd
import itertools

# Source DataFrame
df = pd.DataFrame(data=[
    [ 5153, '04/04/2017 11:40:00', True ], [ 7542, '04/04/2017 17:18:00', True ],
    [  875, '04/04/2017 20:08:00', True ], [   74, '05/04/2017 20:30:00', False ],
    [ 9652, '06/04/2017 20:32:00', True ], [  965, '07/04/2017 12:52:00', True ],
    [  752, '10/04/2017 17:40:00', True ], [ 9541, '10/04/2017 19:29:00', True ],
    [ 7452, '11/04/2017 12:20:00', True ], [ 9651, '12/04/2017 13:57:00', False ]],
    columns=[ 'TRN', 'TRN_DATE', 'FILTER_OK' ])
# Work list 
xx = []
# Collect groups for 'True' key with at least 5 members
for key, group in itertools.groupby(df.values, lambda x: x[2]):
    lst = list(group)
    if key and len(lst) >= 5:
        xx.extend(lst)
# Create result DataFrame with the same column names
df2 = pd.DataFrame(data=xx, columns=df.columns)

Upvotes: 0

BENY

Reputation: 323226

This is will also consider 4 consecutive False

s=df.FILTER_OK.astype(int).diff().ne(0).cumsum()
df[s.isin(s.value_counts().loc[lambda x : x>4].index)]
Out[784]: 
    TRN            TRN_DATE  FILTER_OK
4  9652  06/04/201720:32:00       True
5   965  07/04/201712:52:00       True
6   752  10/04/201717:40:00       True
7  9541  10/04/201719:29:00       True
8  7452  11/04/201712:20:00       True

Upvotes: 1

cs95

Reputation: 402333

You may do this using cumsum, followed by groupby, and transform:

v = (~df.FILTER_OK).cumsum()
df[v.groupby(v).transform('size').ge(4) & df['FILTER_OK']]

    TRN            TRN_DATE  FILTER_OK
4  9652 2017-06-04 20:32:00       True
5   965 2017-07-04 12:52:00       True
6   752 2017-10-04 17:40:00       True
7  9541 2017-10-04 19:29:00       True
8  7452 2017-11-04 12:20:00       True

Details
First, use cumsum to segregate rows into groups:

v = (~df.FILTER_OK).cumsum()
v

0    0
1    0
2    0
3    1
4    1
5    1
6    1
7    1
8    1
9    2
Name: FILTER_OK, dtype: int64

Next, find the size of each group, and then figure out what groups have at least X rows (in your case, 4):

v.groupby(v).transform('size')

0    3
1    3
2    3
3    6
4    6
5    6
6    6
7    6
8    6
9    1
Name: FILTER_OK, dtype: int64

v.groupby(v).transform('size').ge(4)

0    False
1    False
2    False
3     True
4     True
5     True
6     True
7     True
8     True
9    False
Name: FILTER_OK, dtype: bool

AND this mask with "FILTER_OK" to ensure we only take valid rows that fit the criteria.

v.groupby(v).transform('size').ge(4) & df['FILTER_OK']

0    False
1    False
2    False
3    False
4     True
5     True
6     True
7     True
8     True
9    False
Name: FILTER_OK, dtype: bool

Upvotes: 1

Extract groups of consecutive values having greater than specified size

Answers (4)

Related Questions