Reputation: 1207
I have a dataframe as follows. I want to replace the row values based on the following condition.
if 3 consecutive previous row values are 0, then keep the values as it is or if only one previous row value is 0, then fill that row by rolling mean for last 3 rows for that particular IEMI.
First the dataframe (df)
.
IMEI KVA KwH Expected KVA
55647 1307.65 1020.33 1307.65
55468 2988.00 1109.05 2988.00
55647 0.00 977.87 0.00
55467 0.00 1388.25 0.00
55647 0.00 445.37 0.00
55469 1888.97 933.48 1888.97
55647 1338.65 1120.33 1338.65
55468 2088.00 1019.05 2088.00
55647 0.00 977.87 =(1307.65+0.00+1338.65)/3=882.1
55469 1455.28 1388.25 1455.28
55648 2144.38 445.37 2144.38
55469 1888.97 933.48 1888.97
Please observe the calculation of Expected KVA
column. Since the actual KVA
value at 9th row is 0, the value of Expected KVA
is simple average of last three observation (KVA
) for IEMI 55647. But in 3rd, 4th and 5th row since 3 KVA
values for IEMI 55647 is 0, therefore the Expected KVA
value is also 0.
Edit
Edit - Further
Below is a pseudo code snippet. I have reset_index
. Still It may not work
df = df.sort_values(['IMEI'],ascending=0).reset_index(drop=True)
for i in range(0,len(df)):
if df.loc[i:i+2,'KVA']==0:
df['Expected KVA'] = df['KVA']
else:
df['Expected KVA']=df.groupby('IEMI')['KVA'].rolling(min_periods=1,center=True,window=3).mean()
I believe that the above code snippet may fail as I have sorted IEMI
thus index got distorted.
How can it be achieved? A for
loop coupled with groupby
function? Any thought on this?
Upvotes: 0
Views: 925
Reputation: 75080
Below is what i came up with:(I have added 3 extra rows with IMEI : 55674 just for testing)
Removing consecutive 0s with a group of 3 (which needs no action) and slicing on the dataframe:
import itertools
def consecutive(data, stepsize=1):
return np.split(data, np.where(np.diff(data) != stepsize)[0]+1)
a = np.array(df[df.KVA == 0.00].index)
l = consecutive(a)
to_exclude=list(itertools.chain.from_iterable([i.tolist() for i in l if len(i)==3]))
pd.options.mode.chained_assignment = None
df1 = df.loc[~df.index.isin(to_exclude)]
>>df1
IMEI KVA KwH
0 55647 1307.65 1020.33
1 55468 2988.00 1109.05
5 55469 1888.97 933.48
6 55647 1338.65 1120.33
7 55468 2088.00 1019.05
8 55647 0.00 977.87
9 55469 1455.28 1388.25
10 55648 2144.38 445.37
11 55469 1888.97 933.48
12 55674 0.00 6433.00
13 55674 1345.00 6542.00
14 55674 3456.00 6541.00
Assigning the leftover 0s with np.nan
and doing a groupby
with transform
and fillna
with the mean
df1['KVA'] = df1['KVA'].replace(0, np.nan)
df1['KVA'] = df1['KVA'].fillna(df1.fillna(0).groupby(['IMEI'])['KVA'].transform('mean'))
>>df1
IMEI KVA KwH
0 55647 1307.650000 1020.33
1 55468 2988.000000 1109.05
5 55469 1888.970000 933.48
6 55647 1338.650000 1120.33
7 55468 2088.000000 1019.05
8 55647 882.100000 977.87
9 55469 1455.280000 1388.25
10 55648 2144.380000 445.37
11 55469 1888.970000 933.48
12 55674 1600.333333 6433.00
13 55674 1345.000000 6542.00
14 55674 3456.000000 6541.00
Then just concat
and sort_index
those which we had left out earlier:
pd.concat([df1,df.loc[df.index.isin(to_exclude)]]).sort_index()
IMEI KVA KwH
0 55647 1307.650000 1020.33
1 55468 2988.000000 1109.05
2 55647 0.000000 977.87
3 55467 0.000000 1388.25
4 55647 0.000000 445.37
5 55469 1888.970000 933.48
6 55647 1338.650000 1120.33
7 55468 2088.000000 1019.05
8 55647 882.100000 977.87
9 55469 1455.280000 1388.25
10 55648 2144.380000 445.37
11 55469 1888.970000 933.48
12 55674 1600.333333 6433.00
13 55674 1345.000000 6542.00
14 55674 3456.000000 6541.00
Upvotes: 2