gies0r
gies0r

Reputation: 5239

Dataframe sequence detection: Find groups where three rows in a row have negative values

Lets say I have a column df['test']:

-1, -2, -3, 2, -4, 3, -5, -4, -3, -7

So I would like to filter out the groups which have at least three negative values in a row. So

groups = my_grouping_function_by_sequence()
groups[0] = [-1,-2-3]
groups[1] = [-5,-4,-3,-7] 

Are there some pre-defined checks on testing for sequences in numerical data for pandas? It does not need to be pandas, but I am searching for a fast and adaptable solution. Any advice would be helpful. Thanks!

Upvotes: 3

Views: 228

Answers (3)

Erfan
Erfan

Reputation: 42906

Using GroupBy and cumsum to create groups of consecutive negative numbers.

grps = df['test'].gt(0).cumsum()
dfs = [d.dropna() for _, d in df.mask(df['test'].gt(0)).groupby(grps) if d.shape[0] >= 3]

Output

for df in dfs:
    print(df)

   test
0  -1.0
1  -2.0
2  -3.0
   test
6  -5.0
7  -4.0
8  -3.0
9  -7.0

Explanation

Let's go through this step by step: The first line, creates groups for consecutive negative numbers

print(grps)
0    0
1    0
2    0
3    1
4    1
5    2
6    2
7    2
8    2
9    2
Name: test, dtype: int32

But as we can see, it also includes the positive numbers, which we don't want to consider in our ouput. So we use DataFrame.mask to convert these values to NaN:

df.mask(df['test'].gt(0))
# same as df.mask(df['test'] > 0)

   test
0  -1.0
1  -2.0
2  -3.0
3   NaN
4  -4.0
5   NaN
6  -5.0
7  -4.0
8  -3.0
9  -7.0

Then we groupby on this dataframe and only keep the groups which are >= 3 rows:

for _, d in df.mask(df['test'].gt(0)).groupby(grps):
    if d.shape[0] >= 3:
        print(d.dropna())

   test
0  -1.0
1  -2.0
2  -3.0
   test
6  -5.0
7  -4.0
8  -3.0
9  -7.0

Upvotes: 3

wwnde
wwnde

Reputation: 26676

Too acknowledge @erfan answer elegant but didn't easily understand. My attempt below.

df = pd.DataFrame({'test': [-1, -2, -3, 2, -4, 3, -5, -4, -3, -7]})

Conditionally select rows with negatives

df['j'] = np.where(df['test']<0,1,-1)
df['k']=df['j'].rolling(3, min_periods=1).sum()
df2=df[df['k']==3]

slice Iteratively the dataframe getting 3rd and 2 consecutive rows above

for index, row in df2.iterrows():
    print(df.loc[index - 2 : index + 0, 'test'])

Upvotes: 2

David Erickson
David Erickson

Reputation: 16683

@Erfan your answer is brilliant and I'm still trying to understand the second line. Your first line got me started to try to write it in my own, less efficient way.

import pandas as pd
df = pd.DataFrame({'test': [-1, -2, -3, 2, -4, 3, -5, -4, -3, -7]})
df['+ or -'] = df['test'].gt(0)
df['group'] = df['+ or -'].cumsum()

df_gb = df.groupby('group').count().reset_index().drop('+ or -', axis=1)

df_new = pd.merge(df, df_gb, how='left', on='group').drop('+ or -', axis=1)
df_new = df_new[(df_new['test_x'] < 0) & (df_new['test_y'] >=3)].drop('test_y', 
axis=1)

for i in df_new['group'].unique():
    j = pd.DataFrame(df_new.loc[df_new['group'] == i, 'test_x'])
    print(j)

Upvotes: 1

Related Questions