Reputation: 15
I have about 10 million lines of data in my dataframe. Below is an example of how it looks like for 2 rows.
index | Amount | debit/credit |
---|---|---|
0 | 1000 | 1 |
1 | 2000 | 2 |
i want to write a function that checks if the value in the "debit/credit" column is a 1 for debit or a 2 for credit. And replaces the figure in the "Amount" column with a negative number if it is a 2. So for example, the table will change to this:
index | Amount | debit/credit |
---|---|---|
0 | 1000 | 1 |
1 | -2000 | 2 |
Here is the function i wrote but its really slow for 9million lines. Could anyone advise me how do i refactor this code? or is there a more efficient way to perform this task? (using python or sql. preferrably python.)
def change_credits_to_negative(df):
for num in range(len(df)):
if df['debit/credit'].loc[num] == 2: # 1 is for debit & 2 is for credit
df['Amount'].loc[num] = -df['Amount'].loc[num]
Upvotes: 0
Views: 100
Reputation: 23217
This is more a system performance (execution time) analysis rather than an answer, although it suggests slight improvement on the processing time of the 2 solutions:
Using - df['Amount']
is faster than df['Amount'] * -1
since +/-
operations are faster than * /
operations.
Hence, suggest to use:
df.loc[df['debit/credit'].eq(2), 'Amount'] = - df['Amount']
instead of: df.loc[df['debit/credit'].eq(2), 'Amount'] *= -1
For the np.where()
solution, suggest to use:
df['Amount'] = np.where(df['debit/credit'].eq(2), - df['Amount'], df['Amount'])
instead of df['Amount'] = np.where(df['debit/credit'].eq(2), df['Amount']*-1, df['Amount'])
df.loc
solution:%%timeit
df.loc[df['debit/credit'].eq(2), 'Amount'] *= -1
728 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.loc[df['debit/credit'].eq(2), 'Amount'] *= - df.loc[df['debit/credit'].eq(2), 'Amount']
1.08 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.loc[df['debit/credit'].eq(2), 'Amount'] = - df['Amount']
689 µs ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The 2nd one with df.loc
at both sides are the slowest probably because the .loc
processing also takes time. The 3rd one is fastest.
np.where
solution:%%timeit
df['Amount'] = np.where(df['debit/credit'].eq(2), df['Amount']*-1, df['Amount'])
539 µs ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df['Amount'] = np.where(df['debit/credit'].eq(2), - df['Amount'], df['Amount'])
498 µs ± 2.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
- df['Amount']
outperforms df['Amount'] * -1
in both solutions.
Upvotes: 3
Reputation: 8816
Other way around:
df.loc[df['debit/credit'].astype(str).str.contains('2'), 'Amount'] *= -1
print(df)
Amount debit/credit
0 1000 1
1 -2000 2
Upvotes: 1
Reputation: 1875
You could do it with .loc
, but without a loop:
df.loc[df['debit/credit'].eq(2), 'Amount'] *= -1
Output:
Amount debit/credit
0 1000 1
1 -2000 2
OR
via np.where()
:
import numpy as np
df['Amount'] = np.where(df['debit/credit'].eq(2), df['Amount']*-1, df['Amount'])
PERFORMANCE TESTS:
Let's create a sample dataframe with 2 columns and 10 million rows:
import time
df = pd.DataFrame({'Amount': np.random.randint(1000, 10000, size=10000000),
'debit/credit': np.random.randint(1, size=10000000) + 1})
1) LOOP:
start = time.perf_counter()
change_credits_to_negative(df)
stop = time.perf_counter()
print(stop - start)
97.34215749999998
2) LOC:
start = time.perf_counter()
df.loc[df['debit/credit'].eq(2), 'Amount'] *= -1
stop = time.perf_counter()
print(stop - start)
0.03006110000001172
It gives us 97 sec. with the loop and 0.03 sec. without it.
Upvotes: 6