Reputation: 5940
I have a dataframe df
that is defined like so:
import numpy as np
import pandas as pd
dic = {'A':['1A','1A','3C','3C','3C','7M','7M','7M'],'B':[10,15,49,75,35,33,45,65],'C':[11,56,32,78,45,89,15,14],'D':[111,0,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],'E':[0,222,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
df = pd.DataFrame(dic)
My goal is to perform some calculations between rows having the same item in column A
.
The function is defined like this (but could be anything):
def fun(a,b,c,d):
out = a*c + b/2 + d*b
return out
The result of such operation will be stored in column D and E according to the following rules:
# Fill column D
for j in range(0,len(df)-1):
if df['A'].iloc[j]==df['A'].iloc[j+1] and pd.isnull(df['D'].iloc[j]):
df['D'].iloc[j] = fun(df['B'].iloc[j],df['B'].iloc[j],df['B'].iloc[j+1],df['B'].iloc[j+1])
# Fill column E
for j in reversed(range(1,len(df))):
if df['A'].iloc[j-1]==df['A'].iloc[j] and pd.isnull(df['E'].iloc[j]):
df['E'].iloc[j] = fun(df['B'].iloc[j],df['B'].iloc[j],df['B'].iloc[j-1],df['B'].iloc[j-1])
Both loops are very similar but the second one is looping from the last element to the first one of the dataframe. My code works correctly and the result should be like this:
# Before # # After #
A B C D E A B C D E
0 1A 10 11 111 0 0 1A 10 11 111.0 0.0
1 1A 15 56 0 222 1 1A 15 56 0.0 222.0
2 3C 49 32 NaN NaN 2 3C 49 32 7374.5 NaN
3 3C 75 78 NaN NaN 3 3C 75 78 5287.5 7387.5
4 3C 35 45 NaN NaN 4 3C 35 45 NaN 5267.5
5 7M 33 89 NaN NaN 5 7M 33 89 2986.5 NaN
6 7M 45 15 NaN NaN 6 7M 45 15 5872.5 2992.5
7 7M 65 14 NaN NaN 7 7M 65 14 NaN 5882.5
Would you be able to improve such code in order to make it more efficient using some build in functions from the Pandas library? I would like to have some more elegant way to achieve my result.
Note: the first and second row have already a value (111 0
and 0 222
) and therefore they must not be calculated by the function!
Upvotes: 4
Views: 2622
Reputation: 5940
In order to solve my problem I defined another function which takes as input fun
def fun2(df,s):
X= fun(df.B,df.C,df.B.shift(s),df.C.shift(s))
return X
Columns D
and E
can be filled like so:
df2['D']=np.where((df2.A.shift(-1)==df2.A) & (df2.D.isnull()==True),fun2(df2,-1),df2.D)
df2['E']=np.where((df2.A.shift(1)==df2.A) & (df2.E.shift(1).isnull()==True),fun2(df2,+1),df2.E)
Note: even though more compact, this approach may be slower
Upvotes: 0
Reputation: 1790
You could first group by the values of A, and then apply a vectorized function :
def fun(a,b,c,d):
out = a*c + b/2 + d*b
return out
def apply_func(df):
mask = pd.isnull(df['D'][:-1])
df['D'][:-1][mask] = fun(df['B'][:-1].values, df['B'][:-1].values,
df['B'][1:].values, df['B'][1:].values)
mask = pd.isnull(df['E'][1:])
df['E'][1:][mask] = fun(df['B'][1:].values, df['B'][1:].values,
df['B'][:-1].values, df['B'][:-1].values)
return df
and then :
df = df.groupby('A').apply(apply_func).reset_index(drop=True)
A B C D E
0 1A 10 11 305.0 NaN
1 1A 15 56 NaN 307.5
2 3C 49 32 7374.5 NaN
3 3C 75 78 5287.5 7387.5
4 3C 35 45 NaN 5267.5
5 7M 33 89 2986.5 NaN
6 7M 45 15 5872.5 2992.5
7 7M 65 14 NaN 5882.5
Upvotes: 2
Reputation: 9711
you could use np.where and dataframe.shift()
datafrmae.shift() - Shift index by desired number of periods with an optional time freq
df['D']=np.where(df.A.shift(-1)==df.A,func(df['B'],df['B'],df.B.shift(-1),df.B.shift(-1)),np.NaN)
Upvotes: 1