tomferrari
tomferrari

Reputation: 53

Python - Count row between interval in dataframe

I have a dataset with a date, engine, energy and max power column. Let's say that the dataset is composed of 2 machines and a depth of one month. Each machine has a maximum power (say 100 for simplicity). Each machine with 3 operating states (between Pmax and 80% of Pmax either nominal power, between 80% and 20% of Pmax or drop in load and finally below 20% of Pmax at 0 we consider that the machine stops below 20%)

The idea is to know, by period and machine, the number of times the machine has operated in the 2nd interval (between 80% and 20% of the Pmax). If a machine drops to stop it should not be counted and if it returns from stop it should not be counted either.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy.ma.extras import _ezclump as ez


data = {'date': ['01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020',
                 '01/01/2020', '01/02/2020', '01/03/2020', '01/04/2020', '01/05/2020', '01/06/2020', '01/07/2020', '01/08/2020', '01/09/2020', '01/10/2020', '01/11/2020', '01/12/2020', '01/13/2020', '01/14/2020', '01/15/2020', '01/16/2020', '01/17/2020', '01/18/2020', '01/19/2020', '01/20/2020', '01/21/2020', '01/22/2020', '01/23/2020', '01/24/2020', '01/25/2020', '01/26/2020', '01/27/2020', '01/28/2020', '01/29/2020', '01/30/2020', '01/31/2020'],
                'engine': ['a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a',
                           'b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b','b',],
                'energy': [100,100,100,100,100,80,80,60,60,60,60,60,90,100,100,50,50,40,20,0,0,0,20,50,60,100,100,50,50,50,50,
                            50,50,100,100,100,80,80,60,60,60,60,60,0,0,0,50,50,100,90,50,50,50,50,50,60,100,100,50,50,100,100],
                'pmax': [100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,
                         100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100]
                }
df = pd.DataFrame(data, columns = ['date', 'engine', 'energy', 'pmax'])

df['date'] = df['date'].astype('datetime64[ns]')

df = df.set_index('date')


df['inter'] = df['energy'].apply(lambda x: 2 if x >= 80 else (1 if x < 80 and x >= 20 else 0 ))

liste = []
engine_off = ez((df['inter'] == 1).to_numpy())
for i in engine_off:
    if df.iloc[(i.start)-1, 3] == 0:
        engine_off.remove(i)
    elif df.iloc[(i.stop), 3] == 0:
        engine_off.remove(i)
    else:
        liste.append([df['engine'][i.start],df.index[i.start],df.index[i.stop], i.stop - i.start])

dfend = pd.DataFrame(liste, columns=['engine','begin','end','nb_heure'])

dfend['month'] = dfend['begin'].dt.month_name()

dfgroupe = dfend.set_index('begin').groupby(['engine','month']).agg(['mean','max','min','std','count','sum']).fillna(1)

Either I recover my data in a Dataframe, I classify for each line the associated energy in an interval (2 for nominal operation, 1 for intermediate and 0 for stop)

Then I check if each row in the interval == 1 column allowing me to retrieve a list of slices with the start and end of each slice.

Then I loop to check that each element before or after my slice is different from 0 to exclude the decreases for stop or return from stop.

Then I create a dataframe from the list, then I average, sum, etc.

The problem is that my list has only 4 drops while there are 5 drops. This comes from the 4 slice (27.33)

Can someone help me?

Thank you

Upvotes: 1

Views: 257

Answers (2)

Ben.T
Ben.T

Reputation: 29635

here is one way to do it, I tried to use your way with groups but ended up to do it slightly differently

# another way to create inter, probably faster on big dataframe
df['inter'] = pd.cut(df['energy']/df['pmax'], [-1,0.2, 0.8, 1.01], 
                     labels=[0,1,2], right=False)

# mask if inter is equal to 1 and groupby engine
gr = df['inter'].mask(df['inter'].eq(1)).groupby(df['engine'])

# create a mask to get True for the rows you want
m = (df['inter'].eq(1)   # the row are 1s
     & ~gr.ffill().eq(0) # the row before 1s is not 0
     & ~gr.bfill().eq(0) # the row after 1s is not 0
    )

#create dfend with similar shape to yours
dfend = (df.assign(date=df.index) #create a column date for the agg
           .where(m) # replace the rows not interesting by nan
           .groupby(['engine', #groupby per engine
                     m.ne(m.shift()).cumsum()]) # and per group of following 1s
           .agg(begin=('date','first'), #agg date with both start date
                end = ('date','last'))  # and end date
         )
# create the colum nb_hours (although here it seems to be nb_days)
dfend['nb_hours'] = (dfend['end'] - dfend['begin']).dt.days+1

print (dfend)
                  begin        end  nb_hours
engine inter                                
a      2     2020-01-08 2020-01-12         5
       4     2020-01-28 2020-01-31         4
b      4     2020-01-01 2020-01-02         2
       6     2020-01-20 2020-01-25         6
       8     2020-01-28 2020-01-29         2

and you got the three segment for engine b as required, then you can

#create dfgroupe
dfgroupe = (dfend.groupby(['engine', #groupby engine
                           dfend['begin'].dt.month_name()]) #and month name
                 .agg(['mean','max','min','std','count','sum']) #agg
                 .fillna(1)
           )
print (dfgroupe)
                nb_hours                            
                    mean max min       std count sum
engine begin                                        
a      January  4.500000   5   4  0.707107     2   9
b      January  3.333333   6   2  2.309401     3  10

Upvotes: 1

Essi
Essi

Reputation: 16

I am assuming the following terminology:
- 80 <= energy <= 100 ---> df['inter'] == 2, normal mode.
- 20 <= energy < 80 ---> df['inter'] == 1, intermediate mode.
- 20 > energy ---> df['inter'] == 0, stop mode.

I reckon you want to find those periods of time in which:
1) The machine is operating in intermediate mode.
2) You don't want to count if the status is changing from intermediate to stop mode or from stop to intermediate mode.

# df['before']: this is to compare each row of df['inter'] with the previous row
# df['after']: this is to compare each row of df['inter'] with the next row
# df['target'] == 1 is when both above mentioned conditions (conditions 1 and 2) are met.
# In the next we mask the original df and keep those times that conditions 1 and 2 are met, then we group by machine and month, and after that obtain the min, max, mean, and so on.

df['before'] = df['inter'].shift(periods=1, fill_value=0)
df['after'] = df['inter'].shift(periods=-1, fill_value=0)
df['target'] = np.where((df['inter'] == 1) & (np.sum(df[['inter', 'before', 'after']], axis=1) > 2), 1, 0)
df['month'] = df['date'].dt.month
mask = df['target'] == 1
df_group = df[mask].groupby(['engine', 'month']).agg(['mean', 'max', 'min', 'std', 'count', 'sum'])

Upvotes: 0

Related Questions