Reputation: 5239
Lets say I have a column df['test']
:
-1, -2, -3, 2, -4, 3, -5, -4, -3, -7
So I would like to filter out the groups which have at least three negative values in a row. So
groups = my_grouping_function_by_sequence()
groups[0] = [-1,-2-3]
groups[1] = [-5,-4,-3,-7]
Are there some pre-defined checks on testing for sequences in numerical data for pandas? It does not need to be pandas, but I am searching for a fast and adaptable solution. Any advice would be helpful. Thanks!
Upvotes: 3
Views: 228
Reputation: 42906
Using GroupBy
and cumsum
to create groups of consecutive negative numbers.
grps = df['test'].gt(0).cumsum()
dfs = [d.dropna() for _, d in df.mask(df['test'].gt(0)).groupby(grps) if d.shape[0] >= 3]
Output
for df in dfs:
print(df)
test
0 -1.0
1 -2.0
2 -3.0
test
6 -5.0
7 -4.0
8 -3.0
9 -7.0
Explanation
Let's go through this step by step: The first line, creates groups for consecutive negative numbers
print(grps)
0 0
1 0
2 0
3 1
4 1
5 2
6 2
7 2
8 2
9 2
Name: test, dtype: int32
But as we can see, it also includes the positive numbers, which we don't want to consider in our ouput. So we use DataFrame.mask
to convert these values to NaN
:
df.mask(df['test'].gt(0))
# same as df.mask(df['test'] > 0)
test
0 -1.0
1 -2.0
2 -3.0
3 NaN
4 -4.0
5 NaN
6 -5.0
7 -4.0
8 -3.0
9 -7.0
Then we groupby on this dataframe and only keep the groups which are >=
3 rows:
for _, d in df.mask(df['test'].gt(0)).groupby(grps):
if d.shape[0] >= 3:
print(d.dropna())
test
0 -1.0
1 -2.0
2 -3.0
test
6 -5.0
7 -4.0
8 -3.0
9 -7.0
Upvotes: 3
Reputation: 26676
Too acknowledge @erfan answer elegant but didn't easily understand. My attempt below.
df = pd.DataFrame({'test': [-1, -2, -3, 2, -4, 3, -5, -4, -3, -7]})
Conditionally select rows with negatives
df['j'] = np.where(df['test']<0,1,-1)
df['k']=df['j'].rolling(3, min_periods=1).sum()
df2=df[df['k']==3]
slice Iteratively the dataframe getting 3rd and 2 consecutive rows above
for index, row in df2.iterrows():
print(df.loc[index - 2 : index + 0, 'test'])
Upvotes: 2
Reputation: 16683
@Erfan your answer is brilliant and I'm still trying to understand the second line. Your first line got me started to try to write it in my own, less efficient way.
import pandas as pd
df = pd.DataFrame({'test': [-1, -2, -3, 2, -4, 3, -5, -4, -3, -7]})
df['+ or -'] = df['test'].gt(0)
df['group'] = df['+ or -'].cumsum()
df_gb = df.groupby('group').count().reset_index().drop('+ or -', axis=1)
df_new = pd.merge(df, df_gb, how='left', on='group').drop('+ or -', axis=1)
df_new = df_new[(df_new['test_x'] < 0) & (df_new['test_y'] >=3)].drop('test_y',
axis=1)
for i in df_new['group'].unique():
j = pd.DataFrame(df_new.loc[df_new['group'] == i, 'test_x'])
print(j)
Upvotes: 1