Reputation: 21
I'm not familiar enough with Python to understand how I can make a for loop go faster. Here's what I'm trying to do.
Let's say we have the following dataframe of prices.
import pandas as pd
df = pd.DataFrame.from_dict({'price': {0: 98, 1: 99, 2: 101, 3: 99, 4: 97, 5: 100, 6: 100, 7: 98}})
The goal is to create a new column called updown, which classifies each row as "up" or "down", signifying what comes first when looking at each subsequent row - up by 2, or down by 2.
df['updown'] = 0
for i in range(df.shape[0]):
j=0
while df.price.iloc[i+j] < (df.price.iloc[i] + 2) and df.price.iloc[i+j] > (df.price.iloc[i] - 2):
j= j+1
if df.price.iloc[i+j] >= (df.price.iloc[i] + 2):
df.updown.iloc[i] = "Up"
if df.price.iloc[i+j] <= (df.price.iloc[i] - 2):
df.updown.iloc[i] = "Down"
This works just fine, but simply runs too slow when running on millions of rows. Note that I am aware the code throws an error once it gets to the last row, which is fine with me.
Where can I learn how to make something like this happen much faster (ideally seconds, or at least minutes, as opposed to 10+ hours, which is how long it takes right now.
Upvotes: 1
Views: 151
Reputation: 795
Running through a bunch of different examples, the second method in the following code is approximate x75 faster for the example dataset:
import pandas as pd, numpy as np
from random import randint
import time
data = [randint(90, 120) for i in range(10000)]
df1 = pd.DataFrame({'price': data})
t0 = time.time()
df1['updown'] = np.nan
count = df1.shape[0]
for i in range(count):
j = 1
up = df1.price.iloc[i] + 2
down = up - 4
while (pos := i + j) < count:
if(value := df1.price.iloc[pos]) >= up:
df1.loc[i, 'updown'] = "Up"
break
elif value <= down:
df1.loc[i, 'updown'] = "Down"
break
else:
j = j + 1
t1 = time.time()
print(f'Method 1: {t1 - t0}')
res1 = df1.head()
df2 = pd.DataFrame({'price': data})
t2 = time.time()
count = len(df2)
df2['updown'] = np.nan
up = df2.price + 2
down = df2.price - 2
# increase shift range until updown is set for all columns
# or there is insufficient data to change remaining rows
i = 1
while (i < count) and (not (isna := df2.updown.isna()) is None and ((i == 1) or (isna[:-(i - 1)].any()))):
shift = df2.price.shift(-i)
df2.loc[isna & (shift >= up), 'updown'] = 'Up'
df2.loc[isna & (shift <= down), 'updown'] = 'Down'
i += 1
t3 = time.time()
print(f'Method 2: {t3 - t2}')
s1 = df1.updown
s2 = df2.updown
match = (s1.isnull() == s2.isnull()).all() and (s1[s1.notnull()] == s2[s2.notnull()]).all()
print(f'Series match: {match}')
The main reason for the speed improvement is instead of iterating across the rows in python, we are doing operations on arrays of data which will all happen in C code. While python calling into pandas or numpy (which are C libraries) is quite quick, there is some overhead, and if you are doing this lots of time it very quickly becomes the limiting factor.
The performance increase is dependent on input data, but scales with the number of rows in the dataframe: the more rows the slower it is to iterate:
iterations method1 method2 increase
0 100 0.056002 0.018267 3.065689
1 1000 0.209895 0.005000 41.982070
2 10000 2.625701 0.009001 291.727054
3 100000 108.080149 0.042001 2573.260448
Upvotes: 3
Reputation: 795
There are various errors stopping the example code from working, at least for me. Could you please confirm this is what you want the algorithm to do?
import pandas as pd
df = pd.DataFrame.from_dict({'price': {0: 98, 1: 99, 2: 101, 3: 99, 4: 97, 5: 100, 6: 100, 7: 98}})
df['updown'] = 0
count = df.shape[0]
for i in range(count):
j = 1
up = df.price.iloc[i] + 2
down = up - 4
while (pos := i + j) < count:
if(value := df.price.iloc[pos]) >= up:
df.loc[i, 'updown'] = "Up"
break
elif value <= down:
df.loc[i, 'updown'] = "Down"
break
else:
j = j + 1
print(df)
Upvotes: 0