dh81
dh81

Reputation: 67

Find subsequent rows with values greate than a treshold and mark those rows

I have a data frame with measurements every 10 minutes (n=85000 values, thus should be most efficient). I need to count how often a value is larger than 400, but only if this limit is hit also for at least 30 minutes (thus we search for groups of 3 consectutive rows, which hit the criteria lws>400 ).

Example would be:

df = pd.DataFrame({ 
    'datetime':pd.date_range('1/1/2011', periods=20, freq='10Min'),
    'lws':np.random.randint(300, 500, size=(20,)),
    })
df.set_index('datetime', inplace=True)

df['alarm']=np.where(df['lws']>400, 1, 0)

gives

datetime,lws,alarm
2011-01-01 00:00:00,338,0
2011-01-01 00:10:00,435,**1
2011-01-01 00:20:00,415,**1
2011-01-01 00:30:00,321,0
2011-01-01 00:40:00,489,**1
2011-01-01 00:50:00,327,0
2011-01-01 01:00:00,481,**1
2011-01-01 01:10:00,316,0
2011-01-01 01:20:00,317,0
2011-01-01 01:30:00,455,1
2011-01-01 01:40:00,457,1
2011-01-01 01:50:00,442,1
2011-01-01 02:00:00,407,1
2011-01-01 02:10:00,407,**1
2011-01-01 02:20:00,416,**1
2011-01-01 02:30:00,331,0
2011-01-01 02:40:00,367,0
2011-01-01 02:50:00,360,0
2011-01-01 03:00:00,416,**1
2011-01-01 03:10:00,341,0

Result should be a new dataframe column containing a "1" if criteria is hit and a "0" if not, wrong assumptions of trial here are marked with **(2 asterix)

The fast answer here is not working correctly. I now established a possibilty, which is actually working, but needs additional columns, and might be slow for huge datasets:

m = df['lws']>400 
df['m']=np.where (df['lws']>400, 1, 0)
df['differ'] = df.m.diff().ne(0).cumsum()
df['gs']= df.groupby ('differ')['m'].transform(len)
df ['flag']= np.where ((df['gs']>2)& (m),1, 0 )

Can someone find a solution, which does not involve the additional columns?

With the idea from here

Many thanks for help!

Upvotes: 0

Views: 82

Answers (2)

jezrael
jezrael

Reputation: 863611

Use GroupBy.cumcount for counter by consecutive Trues generated by >400 from back for selecting last 2 values:

np.random.seed(2021)
df = pd.DataFrame({ 
    'datetime':pd.date_range('1/1/2011', periods=20, freq='10Min'),
    'lws':np.random.randint(300, 500, size=(20,)),
    })
df.set_index('datetime', inplace=True)

m = df['lws']>400 
s = (~m).cumsum()

mask = (m.groupby(s).cumcount(ascending=False) < 2) & m

df['alarm'] =  np.select([mask, m], ['**1',1], 0)
print (df)
                     lws alarm
datetime                      
2011-01-01 00:00:00  416   **1
2011-01-01 00:10:00  385     0
2011-01-01 00:20:00  357     0
2011-01-01 00:30:00  428   **1
2011-01-01 00:40:00  409   **1
2011-01-01 00:50:00  394     0
2011-01-01 01:00:00  344     0
2011-01-01 01:10:00  362     0
2011-01-01 01:20:00  457   **1
2011-01-01 01:30:00  321     0
2011-01-01 01:40:00  393     0
2011-01-01 01:50:00  452     1
2011-01-01 02:00:00  440     1
2011-01-01 02:10:00  498   **1
2011-01-01 02:20:00  402   **1
2011-01-01 02:30:00  370     0
2011-01-01 02:40:00  333     0
2011-01-01 02:50:00  401   **1
2011-01-01 03:00:00  307     0
2011-01-01 03:10:00  301     0

If need replace **1 to 0:

m = df['lws']>400 
s = (~m).cumsum()

df['alarm'] =  np.where((m.groupby(s).cumcount(ascending=False) > 1) & m, 1, 0)
print (df)
                     lws  alarm
datetime                       
2011-01-01 00:00:00  416      0
2011-01-01 00:10:00  385      0
2011-01-01 00:20:00  357      0
2011-01-01 00:30:00  428      0
2011-01-01 00:40:00  409      0
2011-01-01 00:50:00  394      0
2011-01-01 01:00:00  344      0
2011-01-01 01:10:00  362      0
2011-01-01 01:20:00  457      0
2011-01-01 01:30:00  321      0
2011-01-01 01:40:00  393      0
2011-01-01 01:50:00  452      1
2011-01-01 02:00:00  440      1
2011-01-01 02:10:00  498      0
2011-01-01 02:20:00  402      0
2011-01-01 02:30:00  370      0
2011-01-01 02:40:00  333      0
2011-01-01 02:50:00  401      0
2011-01-01 03:00:00  307      0
2011-01-01 03:10:00  301      0

EDIT:

If need solution without helper columns with size instead len for better performance use:

m = df['lws']>400 

gs= m.groupby (m.diff().ne(0).cumsum()).transform('size')
df ['flag']= np.where ((gs>2)& m,1, 0 )

Upvotes: 2

Jules Civel
Jules Civel

Reputation: 641

Here is a solution I made using chunks (that can be very useful in your case)

import pandas as pd
import numpy as np

df = pd.DataFrame({ 
    'datetime':pd.date_range('1/1/2011', periods=20, freq='10Min'),
    'lws':np.random.randint(300, 500, size=(20,)),
    })
df.set_index('datetime', inplace=True)

def chunker(seq, size, overlap):
    for pos in range(0, len(seq), size-overlap):
        yield seq.iloc[pos:pos + size] 

chunk_size = 3
chunk_overlap = 2

chnk = chunker(df, chunk_size, chunk_overlap)

alarm=[]
for chunk in chnk :
#Checking if length = 3 to avoid errors when accessing the last rows
  if (len(chunk['lws'])== 3 and chunk['lws'][0]>400 and chunk['lws'][1]>400 and chunk['lws'][2]>400 ):
    alarm.append(1)
  else :
    alarm.append(0)
df['alarm'] = alarm

Output :

2011-01-01 00:00:00 342 0
2011-01-01 00:10:00 331 0
2011-01-01 00:20:00 326 0
2011-01-01 00:30:00 480 0
2011-01-01 00:40:00 328 0
2011-01-01 00:50:00 317 0
2011-01-01 01:00:00 495 0
2011-01-01 01:10:00 445 0
2011-01-01 01:20:00 320 0
2011-01-01 01:30:00 390 0
2011-01-01 01:40:00 360 0
2011-01-01 01:50:00 459 0
2011-01-01 02:00:00 358 0
2011-01-01 02:10:00 339 0
2011-01-01 02:20:00 423 1
2011-01-01 02:30:00 413 0
2011-01-01 02:40:00 442 0
2011-01-01 02:50:00 378 0
2011-01-01 03:00:00 410 0
2011-01-01 03:10:00 375 0

Upvotes: 0

Related Questions