SMJune
SMJune

Reputation: 407

Flag if row meets criteria within n days

In a previous question, @KartikeySingh, got me very close. But I need to refine it further by flagging only the positive inflow when at least 90% of it flows out (negative flow) within a 5 day period. So in the example below, index 4 and 5 should not get flagged, but index items 7, 10, 17 and 19 should get flagged because the inflows and outflows meet those parameters. So how would I flag only the inflow and outflows that are within 90% (negative value) of each other and the outflow occurs within 5 days of the inflow.

stream = [2, 0, 1, 0, 3, 2, 100, 0, 0, -95, 3, 0, 2, -1, 0, 2, 93, -2, -89]
date = [
'2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04', '2019-01-05',
'2019-01-06', '2019-01-07', '2019-01-08', '2019-01-09', '2019-01-10',
'2019-01-11', '2019-01-13', '2019-01-14', '2019-01-15', '2019-01-16',
'2019-01-17', '2019-01-18', '2019-01-19', '2019-01-20'
]

df = pd.DataFrame({'date': date, 'stream': stream})

def process(row):
if row['stream'] > 20*row['stream_mean']:
    return 1
else:
    return 0
df['stream_mean'] = df['stream'].rolling(5).mean()
df['stream_mean'] = df['stream_mean'].shift(periods=1)
df['flag'] = df.apply(process,axis=1)
df

The code above flags all incoming flows regardless of the outflow criteria.

Upvotes: 1

Views: 272

Answers (1)

Inder
Inder

Reputation: 3816

The solution to this problem is easier with .loc, you can use the following code,comments in the code are explaining the logic:

This part is just copy of your code from the question:

import pandas as pd
import numpy as np
stream = [2, 0, 1, 0, 3, 2, 100, 0, 0, -95, 3, 0, 2, -1, 0, 2, 93, -2, -89]
date = [
'2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04', '2019-01-05',
'2019-01-06', '2019-01-07', '2019-01-08', '2019-01-09', '2019-01-10',
'2019-01-11', '2019-01-13', '2019-01-14', '2019-01-15', '2019-01-16',
'2019-01-17', '2019-01-18', '2019-01-19', '2019-01-20'
]
df = pd.DataFrame({'date': date, 'stream': stream})

This is the code that solves the problem:

p_list=[n for n in df.stream if n >0] # we are getting positive values from stream column
p_mean=sum(p_list)/len(p_list) # finding mean (as unusual value threshold) 
n_list=[n for n in df.stream if n <0] #similarly getting negative values
n_mean=sum(n_list)/len(n_list) #finding threshold on the negative side.

After getting the threshold values(do see that you can manually set these values if you like this is just an attempt at automating the entire process).

p_flags=df.index[(df.stream > p_mean) &((df.stream.shift(-1) <=-.9*df.stream ) | (df.stream.shift(-2)<=-0.9*df.stream
                                                                      )|(df.stream.shift(-3)<=-0.9*df.stream
                                                                        )|(df.stream.shift(-4)<=-0.9*df.stream
                                                                          )|(df.stream.shift(-5)<=-0.9*df.stream
                                                                            ))]

what this will do is return the index of the rows with matching criteria, the logic of the criteria is very simply that we are checking if a row is greater than the p_mean value, if it is then we check if any of the next five rows have value that is less than 90% of such a value (that is outflow) | operator means or so any outflow in the next 5 will mean the part returns true.

in order to find the negative flags index we have a similar thing only in revers:

n_flags=df.index[(df.stream < n_mean) &((df.stream.shift(1) >=0.9*df.stream ) | (df.stream.shift(2)>=0.9*df.stream
                                                                  )|(df.stream.shift(3)>=0.9*df.stream
                                                                    )|(df.stream.shift(4)>=0.9*df.stream
                                                                      )|(df.stream.shift(5)>=0.9*df.stream
                                                                        ))]

Now you have the indices if the rows that matches the criteria in order to add this as a column in the dataframe simply do a:

flags=np.zeros(len(df))
for i1,i2 in zip(n_flags,p_flags):
    flags[i1]=1
    flags[i2]=1

df["flags"]=flags  
print(df)

The output will be:

    date    stream  flags
0   2019-01-01  2   0.0
1   2019-01-02  0   0.0
2   2019-01-03  1   0.0
3   2019-01-04  0   0.0
4   2019-01-05  3   0.0
5   2019-01-06  2   0.0
6   2019-01-07  100 1.0
7   2019-01-08  0   0.0
8   2019-01-09  0   0.0
9   2019-01-10  -95 1.0
10  2019-01-11  3   0.0
11  2019-01-13  0   0.0
12  2019-01-14  2   0.0
13  2019-01-15  -1  0.0
14  2019-01-16  0   0.0
15  2019-01-17  2   0.0
16  2019-01-18  93  1.0
17  2019-01-19  -2  0.0
18  2019-01-20  -89 1.0

Upvotes: 1

Related Questions