Reputation: 442
I have a huge data set to process and I am trying to optimize the most costly line, processing wise.
I use a df with 3 columns, A, B and C. I have 2 values, a and b, which are used to update the value of C in a subset of the df.
Before I continue, let me define a textual substitution to increase readability:
filter(_X) -> df.loc[df['A'] < a, _X]
Every time I type "filter", please substitute it with the text on the right (applying the correct argument in place of the parameter _X - think C/C++ macros). The line of code in question is:
filter('C') += a * np.minimum(filter('B'), b)
What I'm not sure about is if python will process "filter" twice when evaluating the expression, or if it will use a "reference" (a-la C++) and only do it once. In the former case, is there a way for me to rewrite the expression in a way to avoid the double execution of the code of "filter"?
Moreover, if you have suggestions on how to rewrite the "filter" itself, I'd be happy to test them.
EDIT: Expanded version of the code:
df.loc[df['A'] < a, 'C'] += a * np.minimum(df.loc[df['A'] < a, 'B'], b)
Upvotes: 1
Views: 381
Reputation: 5126
If I understand correctly, you may not need to "filter twice" after the +=
. see my example below:
np.random.seed(5)
df = pd.DataFrame(np.random.randint(0,100,size=(4, 4)), columns=list('ABCD'))
A B C D
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
Now if you wanted to add the values of the minimum of columns C
and D
to the current value of B
that would simply be: df.loc[df['A'] < 80, 'B'] += np.minimum(df['C'], df['D'])
A B C D
0 99 78.0 61 16
1 73 35.0 62 27 #<--- meets condition 8+27=35
2 30 87.0 7 76 #<--- meets condition 80+7=87
3 15 80.0 80 27 #<--- meets condition 53+27=80
Notice how when A
< 80. the B
value changes with whichever value in C
or D
is smaller. One thing to note is that B
turns to a float. Not sure why.
Upvotes: 1