How can this for loop be written to process faster in Python?

Question

I'm not familiar enough with Python to understand how I can make a for loop go faster. Here's what I'm trying to do.

Let's say we have the following dataframe of prices.

import pandas as pd

df = pd.DataFrame.from_dict({'price': {0: 98, 1: 99, 2: 101, 3: 99, 4: 97, 5: 100, 6: 100, 7: 98}})

The goal is to create a new column called updown, which classifies each row as "up" or "down", signifying what comes first when looking at each subsequent row - up by 2, or down by 2.

df['updown'] = 0
for i in range(df.shape[0]):
    j=0
    while df.price.iloc[i+j] < (df.price.iloc[i] + 2) and df.price.iloc[i+j] > (df.price.iloc[i] - 2):
        j= j+1
    if df.price.iloc[i+j] >= (df.price.iloc[i] + 2):
        df.updown.iloc[i] = "Up"
    if df.price.iloc[i+j] <= (df.price.iloc[i] - 2):
        df.updown.iloc[i] = "Down"

This works just fine, but simply runs too slow when running on millions of rows. Note that I am aware the code throws an error once it gets to the last row, which is fine with me.

Where can I learn how to make something like this happen much faster (ideally seconds, or at least minutes, as opposed to 10+ hours, which is how long it takes right now.

John M. · Accepted Answer

Running through a bunch of different examples, the second method in the following code is approximate x75 faster for the example dataset:

import pandas as pd, numpy as np
from random import randint
import time

data = [randint(90, 120) for i in range(10000)]

df1 = pd.DataFrame({'price': data})

t0 = time.time()
df1['updown'] = np.nan
count = df1.shape[0]
for i in range(count):
    j = 1
    up = df1.price.iloc[i] + 2
    down = up - 4
    while (pos := i + j) < count:
        if(value := df1.price.iloc[pos]) >= up:
            df1.loc[i, 'updown'] = "Up"
            break
        elif value <= down:
            df1.loc[i, 'updown'] = "Down"
            break
        else:
            j = j + 1

t1 = time.time()

print(f'Method 1: {t1 - t0}')
res1 = df1.head()

df2 = pd.DataFrame({'price': data})
t2 = time.time()

count = len(df2)
df2['updown'] = np.nan
up = df2.price + 2
down = df2.price - 2

# increase shift range until updown is set for all columns
# or there is insufficient data to change remaining rows
i = 1
while (i < count) and (not (isna := df2.updown.isna()) is None and ((i == 1) or (isna[:-(i - 1)].any()))):
    shift = df2.price.shift(-i)
    df2.loc[isna & (shift >= up), 'updown'] =  'Up'
    df2.loc[isna & (shift <= down), 'updown'] = 'Down'
    i += 1

t3 = time.time()
print(f'Method 2: {t3 - t2}')

s1 = df1.updown
s2 = df2.updown

match = (s1.isnull() == s2.isnull()).all() and (s1[s1.notnull()] == s2[s2.notnull()]).all()

print(f'Series match: {match}')

The main reason for the speed improvement is instead of iterating across the rows in python, we are doing operations on arrays of data which will all happen in C code. While python calling into pandas or numpy (which are C libraries) is quite quick, there is some overhead, and if you are doing this lots of time it very quickly becomes the limiting factor.

The performance increase is dependent on input data, but scales with the number of rows in the dataframe: the more rows the slower it is to iterate:

   iterations     method1   method2     increase
0         100    0.056002  0.018267     3.065689
1        1000    0.209895  0.005000    41.982070
2       10000    2.625701  0.009001   291.727054
3      100000  108.080149  0.042001  2573.260448

How can this for loop be written to process faster in Python?

Answers (2)

Related Questions