Reputation: 81
I recently learned about pandas
and was happy to see its analytics functionality. I am trying to convert Excel array functions into the Pandas equivalent to automate spreadsheets that I have created for the creation of performance attribution reports. In this example, I created a new column in Excel based on conditions within other columns:
={SUMIFS($F$10:$F$4518,$A$10:$A$4518,$C$4,$B$10:$B$4518,0,$C$10:$C$4518," ",$D$10:$D$4518,$D10,$E$10:$E$4518,$E10)}
The formula is summing up the values in the "F" array (security weights) based on certain conditions. "A" array (portfolio ID) is a certain number, "B" array (security id) is zero, "C" array (group description) is " ", "D" array (start date) is the date of the row that I am on, and "E" array (end date) is the date of the row that I am on.
In Pandas, I am using the DataFrame. Creating a new column on a dataframe with the first three conditions is straight forward, but I am having difficult with the last two conditions.
reportAggregateDF['PORT_WEIGHT'] = reportAggregateDF['SEC_WEIGHT_RATE']
[(reportAggregateDF['PORT_ID'] == portID) &
(reportAggregateDF['SEC_ID'] == 0) &
(reportAggregateDF['GROUP_LIST'] == " ") &
(reportAggregateDF['START_DATE'] == reportAggregateDF['START_DATE'].ix[:]) &
(reportAggregateDF['END_DATE'] == reportAggregateDF['END_DATE'].ix[:])].sum()
Obviously the .ix[:] in the last two conditions is not doing anything for me, but is there a way to make the sum conditional on the row that I am on without looping? My goal is to not do any loops, but instead use purely vector operations.
Upvotes: 6
Views: 10279
Reputation: 927
You want to use the apply function and a lambda:
>> df
A B C D E
0 mitfx 0 200 300 0.25
1 gs 1 150 320 0.35
2 duk 1 5 2 0.45
3 bmo 1 145 65 0.65
Let's say I want to sum column C times E but only if column B == 1 and D is greater than 5:
df['matches'] = df.apply(lambda x: x['C'] * x['E'] if x['B'] == 1 and x['D'] > 5 else 0, axis=1)
df.matches.sum()
It might be cleaner to split this into two steps:
df_subset = df[(df.B == 1) & (df.D > 5)]
df_subset.apply(lambda x: x.C * x.E, axis=1).sum()
or to use simply multiplication for speed:
df_subset = df[(df.B == 1) & (df.D > 5)]
print sum(df_subset.C * df_subset.E)
You are absolutely right to want to do this problem without loops.
Upvotes: 9
Reputation: 81
I'm sure there is a better way, but this did it in a loop:
for idx, eachRecord in reportAggregateDF.T.iteritems():
reportAggregateDF['PORT_WEIGHT'].ix[idx] = reportAggregateDF['SEC_WEIGHT_RATE'][(reportAggregateDF['PORT_ID'] == portID) &
(reportAggregateDF['SEC_ID'] == 0) &
(reportAggregateDF['GROUP_LIST'] == " ") &
(reportAggregateDF['START_DATE'] == reportAggregateDF['START_DATE'].ix[idx]) &
(reportAggregateDF['END_DATE'] == reportAggregateDF['END_DATE'].ix[idx])].sum()
Upvotes: 1