I have about 10 million lines of data in my dataframe. Below is an example of how it looks like for 2 rows. index Amount debit/credit 0 1000 1 1 2000 2 i want to write a function that checks if the value in the "debit/credit" column is a 1 for debit or a 2 for credit. And replaces the figure in the "Amount" column with a negative number if it is a 2. So for example, the table will change to this: index Amount debit/credit 0 1000 1 1 -2000 2 Here is the function i wrote but its really slow for 9million lines. Could anyone advise me how do i refactor this code? or is there a more efficient way to perform this task? (using python or sql. preferrably python.) def change_credits_to_negative(df): for num in range(len(df)): if df['debit/credit'].loc[num] == 2: # 1 is for debit & 2 is for credit df['Amount'].loc[num] = -df['Amount'].loc[num]

Reputation: 15

Pandas: How to write a faster loop which checks one column and then changes the signs of another column, depending on the value of the first column?

I have about 10 million lines of data in my dataframe. Below is an example of how it looks like for 2 rows.

index	Amount	debit/credit
0	1000	1
1	2000	2

i want to write a function that checks if the value in the "debit/credit" column is a 1 for debit or a 2 for credit. And replaces the figure in the "Amount" column with a negative number if it is a 2. So for example, the table will change to this:

index	Amount	debit/credit
0	1000	1
1	-2000	2

Here is the function i wrote but its really slow for 9million lines. Could anyone advise me how do i refactor this code? or is there a more efficient way to perform this task? (using python or sql. preferrably python.)

def change_credits_to_negative(df):
    for num in range(len(df)):
        if df['debit/credit'].loc[num] == 2: # 1 is for debit & 2 is for credit
            df['Amount'].loc[num] = -df['Amount'].loc[num]

Upvotes: 0

Answers (3)

SeaBean

Reputation: 23217

This is more a system performance (execution time) analysis rather than an answer, although it suggests slight improvement on the processing time of the 2 solutions:

Using - df['Amount'] is faster than df['Amount'] * -1 since +/- operations are faster than * / operations.

Hence, suggest to use:

df.loc[df['debit/credit'].eq(2), 'Amount'] = - df['Amount']

instead of: df.loc[df['debit/credit'].eq(2), 'Amount'] *= -1

For the np.where() solution, suggest to use:

df['Amount'] = np.where(df['debit/credit'].eq(2), - df['Amount'], df['Amount'])

instead of df['Amount'] = np.where(df['debit/credit'].eq(2), df['Amount']*-1, df['Amount'])

Benchmarking results:

df.loc solution:

%%timeit
df.loc[df['debit/credit'].eq(2), 'Amount'] *= -1

728 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df.loc[df['debit/credit'].eq(2), 'Amount'] *= - df.loc[df['debit/credit'].eq(2), 'Amount']

1.08 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df.loc[df['debit/credit'].eq(2), 'Amount'] = - df['Amount']

689 µs ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The 2nd one with df.loc at both sides are the slowest probably because the .loc processing also takes time. The 3rd one is fastest.

np.where solution:

%%timeit
df['Amount'] = np.where(df['debit/credit'].eq(2), df['Amount']*-1, df['Amount'])

539 µs ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df['Amount'] = np.where(df['debit/credit'].eq(2), - df['Amount'], df['Amount'])

498 µs ± 2.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

- df['Amount'] outperforms df['Amount'] * -1 in both solutions.

Upvotes: 3

Karn Kumar

Reputation: 8816

Other way around:

df.loc[df['debit/credit'].astype(str).str.contains('2'), 'Amount'] *= -1
print(df)
   Amount  debit/credit
0    1000             1
1   -2000             2

Upvotes: 1

Arkadiusz

Reputation: 1875

You could do it with .loc, but without a loop:

df.loc[df['debit/credit'].eq(2), 'Amount'] *= -1

Output:

    Amount  debit/credit
0     1000             1
1    -2000             2

via np.where():

import numpy as np

df['Amount'] = np.where(df['debit/credit'].eq(2), df['Amount']*-1, df['Amount'])

PERFORMANCE TESTS:

Let's create a sample dataframe with 2 columns and 10 million rows:

import time

df = pd.DataFrame({'Amount': np.random.randint(1000, 10000, size=10000000),
                   'debit/credit': np.random.randint(1, size=10000000) + 1})

1) LOOP:

start = time.perf_counter()

change_credits_to_negative(df)

stop = time.perf_counter()
print(stop - start)

97.34215749999998

2) LOC:

start = time.perf_counter()

df.loc[df['debit/credit'].eq(2), 'Amount'] *= -1

stop = time.perf_counter()
print(stop - start)

0.03006110000001172

It gives us 97 sec. with the loop and 0.03 sec. without it.

Upvotes: 6

Pandas: How to write a faster loop which checks one column and then changes the signs of another column, depending on the value of the first column?

Answers (3)

Benchmarking results:

Related Questions