Reputation: 67
I have a data frame with measurements every 10 minutes (n=85000 values, thus should be most efficient). I need to count how often a value is larger than 400, but only if this limit is hit also for at least 30 minutes (thus we search for groups of 3 consectutive rows, which hit the criteria lws>400 ).
Example would be:
df = pd.DataFrame({
'datetime':pd.date_range('1/1/2011', periods=20, freq='10Min'),
'lws':np.random.randint(300, 500, size=(20,)),
})
df.set_index('datetime', inplace=True)
df['alarm']=np.where(df['lws']>400, 1, 0)
gives
datetime,lws,alarm
2011-01-01 00:00:00,338,0
2011-01-01 00:10:00,435,**1
2011-01-01 00:20:00,415,**1
2011-01-01 00:30:00,321,0
2011-01-01 00:40:00,489,**1
2011-01-01 00:50:00,327,0
2011-01-01 01:00:00,481,**1
2011-01-01 01:10:00,316,0
2011-01-01 01:20:00,317,0
2011-01-01 01:30:00,455,1
2011-01-01 01:40:00,457,1
2011-01-01 01:50:00,442,1
2011-01-01 02:00:00,407,1
2011-01-01 02:10:00,407,**1
2011-01-01 02:20:00,416,**1
2011-01-01 02:30:00,331,0
2011-01-01 02:40:00,367,0
2011-01-01 02:50:00,360,0
2011-01-01 03:00:00,416,**1
2011-01-01 03:10:00,341,0
Result should be a new dataframe column containing a "1" if criteria is hit and a "0" if not, wrong assumptions of trial here are marked with **(2 asterix)
The fast answer here is not working correctly. I now established a possibilty, which is actually working, but needs additional columns, and might be slow for huge datasets:
m = df['lws']>400
df['m']=np.where (df['lws']>400, 1, 0)
df['differ'] = df.m.diff().ne(0).cumsum()
df['gs']= df.groupby ('differ')['m'].transform(len)
df ['flag']= np.where ((df['gs']>2)& (m),1, 0 )
Can someone find a solution, which does not involve the additional columns?
With the idea from here
Many thanks for help!
Upvotes: 0
Views: 82
Reputation: 863611
Use GroupBy.cumcount
for counter by consecutive True
s generated by >400
from back for selecting last 2 values:
np.random.seed(2021)
df = pd.DataFrame({
'datetime':pd.date_range('1/1/2011', periods=20, freq='10Min'),
'lws':np.random.randint(300, 500, size=(20,)),
})
df.set_index('datetime', inplace=True)
m = df['lws']>400
s = (~m).cumsum()
mask = (m.groupby(s).cumcount(ascending=False) < 2) & m
df['alarm'] = np.select([mask, m], ['**1',1], 0)
print (df)
lws alarm
datetime
2011-01-01 00:00:00 416 **1
2011-01-01 00:10:00 385 0
2011-01-01 00:20:00 357 0
2011-01-01 00:30:00 428 **1
2011-01-01 00:40:00 409 **1
2011-01-01 00:50:00 394 0
2011-01-01 01:00:00 344 0
2011-01-01 01:10:00 362 0
2011-01-01 01:20:00 457 **1
2011-01-01 01:30:00 321 0
2011-01-01 01:40:00 393 0
2011-01-01 01:50:00 452 1
2011-01-01 02:00:00 440 1
2011-01-01 02:10:00 498 **1
2011-01-01 02:20:00 402 **1
2011-01-01 02:30:00 370 0
2011-01-01 02:40:00 333 0
2011-01-01 02:50:00 401 **1
2011-01-01 03:00:00 307 0
2011-01-01 03:10:00 301 0
If need replace **1
to 0
:
m = df['lws']>400
s = (~m).cumsum()
df['alarm'] = np.where((m.groupby(s).cumcount(ascending=False) > 1) & m, 1, 0)
print (df)
lws alarm
datetime
2011-01-01 00:00:00 416 0
2011-01-01 00:10:00 385 0
2011-01-01 00:20:00 357 0
2011-01-01 00:30:00 428 0
2011-01-01 00:40:00 409 0
2011-01-01 00:50:00 394 0
2011-01-01 01:00:00 344 0
2011-01-01 01:10:00 362 0
2011-01-01 01:20:00 457 0
2011-01-01 01:30:00 321 0
2011-01-01 01:40:00 393 0
2011-01-01 01:50:00 452 1
2011-01-01 02:00:00 440 1
2011-01-01 02:10:00 498 0
2011-01-01 02:20:00 402 0
2011-01-01 02:30:00 370 0
2011-01-01 02:40:00 333 0
2011-01-01 02:50:00 401 0
2011-01-01 03:00:00 307 0
2011-01-01 03:10:00 301 0
EDIT:
If need solution without helper columns with size
instead len
for better performance use:
m = df['lws']>400
gs= m.groupby (m.diff().ne(0).cumsum()).transform('size')
df ['flag']= np.where ((gs>2)& m,1, 0 )
Upvotes: 2
Reputation: 641
Here is a solution I made using chunks (that can be very useful in your case)
import pandas as pd
import numpy as np
df = pd.DataFrame({
'datetime':pd.date_range('1/1/2011', periods=20, freq='10Min'),
'lws':np.random.randint(300, 500, size=(20,)),
})
df.set_index('datetime', inplace=True)
def chunker(seq, size, overlap):
for pos in range(0, len(seq), size-overlap):
yield seq.iloc[pos:pos + size]
chunk_size = 3
chunk_overlap = 2
chnk = chunker(df, chunk_size, chunk_overlap)
alarm=[]
for chunk in chnk :
#Checking if length = 3 to avoid errors when accessing the last rows
if (len(chunk['lws'])== 3 and chunk['lws'][0]>400 and chunk['lws'][1]>400 and chunk['lws'][2]>400 ):
alarm.append(1)
else :
alarm.append(0)
df['alarm'] = alarm
Output :
2011-01-01 00:00:00 342 0
2011-01-01 00:10:00 331 0
2011-01-01 00:20:00 326 0
2011-01-01 00:30:00 480 0
2011-01-01 00:40:00 328 0
2011-01-01 00:50:00 317 0
2011-01-01 01:00:00 495 0
2011-01-01 01:10:00 445 0
2011-01-01 01:20:00 320 0
2011-01-01 01:30:00 390 0
2011-01-01 01:40:00 360 0
2011-01-01 01:50:00 459 0
2011-01-01 02:00:00 358 0
2011-01-01 02:10:00 339 0
2011-01-01 02:20:00 423 1
2011-01-01 02:30:00 413 0
2011-01-01 02:40:00 442 0
2011-01-01 02:50:00 378 0
2011-01-01 03:00:00 410 0
2011-01-01 03:10:00 375 0
Upvotes: 0