Reputation: 352
Given a pandas dataframe with three columns (C1,C2,C3) and a Series of positive numbers of equal length (coeff) I am computing the fourth column C4, as follows
def event(data, coeff, rate_low=2, rate_high=2):
bot_col_name = 'C4'
data[bot_col_name] = -1
I = data.index
for k in range(len(I)-1):
i = I[k]
next_val = data.at[ I[k+1], 'C1']
c = coeff.at[i]
low_bound = next_val - rate_low*c
high_bound = next_val + rate_high*c
for j in range(k+1, len(data)):
if data.at[ I[j], 'C2'] < low_bound:
data.at[i, bot_col_name] = 0
break
if data.at[ I[j], 'C3'] >= high_bound:
data.at[i, bot_col_name] = 1
break
return data
In other words, given a row, we compute a certain upper and lower bound and then set the respective row element depending on whether we first hit the upper bound under C2 or lower bound on C3.
As an example consider the pandas table D
C1 C2 C3
0 2 5 5
1 10 12 2
2 8 3 17
3 30 25 3
now if coeff is = [3,3,5,7] then when computing the value for the firs row the low_bound is 10-2*3=4 and the high bound is 10+2*3=16. We now have to find the least index i>0 so that D.loc[i, 'C2'] < 4 or D.loc[i,'C3'] >= 16. We see that the first such i is 1 and since this happen to satisfy the first condition we'd set the new column to 0 for this row.
Unfortunately, the above solution is quite inefficient. I've tried optimizing it by computing the values backwards and trying to cache the results (sometimes one can infer the value of C4 from 'past' values) but unfortunately its not significantly better.
In my experience, the best way to gain maximum performance is to try to express as much as possible within the pandas framework.
Is there any meaningful way that one could optimize the above code?
Edit. Using the code of the accepted answer and substituting the following function gives the best results.
@njit
def get_c4(low_bound, high_bound, c2, c3):
r1 = np.argwhere( c2 < low_bound )
r2 = np.argwhere( c3 >= high_bound )
if len(r1) == 0 and len(r2) == 0:
return -1
elif len(r1) == 0:
return 1
elif len(r2) == 0:
return 0
return int (r1[0] > r2[0])
Upvotes: 4
Views: 289
Reputation: 1549
If you really need a fast solution you should use numba.
An alternative to numba would be cython. Both compile your python code in c
to make it way faster but tn my opinion numba is more simple and they have more or less the same performance.
Compiling the code to
c
/Fortran
is what makes numpy / pandas internal functions so fast. More info at the pandas documentation.
Let's first create the example:
import numpy as np
import pandas as pd
from numba import njit
df = pd.DataFrame({
'C1': [2, 10, 8, 30],
'C2': [5, 12, 3, 25],
'C3': [5, 2, 17, 3]
})
coeff = pd.Series([3, 3, 5, 7])
And then by transforming to numba
the code for the answer we get:
@njit
def event_v(data, coeff, rate_low=2, rate_high=2):
out = -np.ones(len(data), dtype=np.int8)
for k in range(len(data) - 1):
next_val = data[k + 1, 0]
c = coeff[k]
low_bound = next_val - rate_low * c
high_bound = next_val + rate_high * c
for j in range(k + 1, len(data)):
if data[j, 1] < low_bound:
out[k] = 0
break
if data[j, 2] >= high_bound:
out[k] = 1
break
return out
df["C4"] = event_v(df.values, coeff.values)
n = 10_000
df = pd.DataFrame(np.random.randint(30, size=[n, 3]), columns=["C1", "C2", "C3"])
coeff = pd.Series(np.random.randint(10, size=n))
%timeit event_v(df.values, coeff.values)
3.39 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit event(df, coeff) # Code from the question
28.4 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
It is around 8500 times faster
n = 1_000_000
df = pd.DataFrame(np.random.randint(30, size=[n, 3]), columns=["C1", "C2", "C3"])
coeff = pd.Series(np.random.randint(10, size=n))
%timeit event_v(df.values, coeff.values)
27.6 s ± 1.16 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
I tried running that with the code of the question and after more than 2 hour the %timeit
did not finish.
Upvotes: 2
Reputation: 304
next_value, low_bound and high_bound can easily be vectorized and their calculation is very fast. The second part is not easy to vectorize, as it potentially needs to scan the whole array for each row. A slight improvement (that becomes more significant for large n) over your implementation is obtained by doing the comparison in numpy arrays.
def get_c4(low_bound, high_bound, c2, c3):
for idx in range(len(c2)):
if c2[idx] < low_bound:
return 0
if c3[idx] >= high_bound:
return 1
return -1
def event_new(data: pd.DataFrame, coeff, rate_low=2, rate_high=2):
data['next_val'] = data['C1'].shift(periods=-1).ffill().astype('int')
data['low_bound'] = (data['next_val'] - rate_low * coeff).astype('int')
data['high_bound'] = (data['next_val'] + rate_high * coeff).astype('int')
c2 = data['C2'].to_numpy()
c3 = data['C3'].to_numpy()
data['C4'] = data.apply(lambda x: get_c4(x.low_bound, x.high_bound, c2[data.index.get_loc(x) + 1:], c3[data.index.get_loc(x) + 1:]), axis=1)
data.drop(columns=['next_val', 'low_bound', 'high_bound'])
return data
Benchmarking code:
for n in [1e2, 1e3, 1e4, 1e5, 1e6]:
n = int(n)
df = pd.DataFrame({'C1': random_list(n=n), 'C2': random_list(n=n), 'C3': random_list(n=n)})
coeff = pd.Series(random_list(start=2, stop=7, n=n))
print(f"n={n}:")
print(f"Time org: {timeit.timeit(lambda: event(df.copy(), coeff), number=1):.3f} seconds")
print(f"Time new: {timeit.timeit(lambda: event_new(df.copy(), coeff), number=1):.3f} seconds")
Output:
n=100:
Time org: 0.007 seconds
Time new: 0.012 seconds
n=1000:
Time org: 0.070 seconds
Time new: 0.048 seconds
n=10000:
Time org: 0.854 seconds
Time new: 0.493 seconds
n=100000:
Time org: 7.565 seconds
Time new: 4.456 seconds
n=1000000:
Time org: 216.408 seconds
Time new: 45.199 seconds
Upvotes: 1
Reputation: 353
Are you sure the algorithm is working exactly how you think it is? Particularly the inner loop where you look for the next row that has C2 < the low bound or C3 >= the high bound (e.g. what happens if both conditions are met by the same row?). Anyway, assuming it is, generally the key to speeding up any algorithm in Pandas, as you have hinted at, is putting everything into a DataFrame and then using column based operations. I would suggest something like the following:
# Setup DataFrame
df = pd.DataFrame({
"C1": [2, 10, 8, 30],
"C2": [5, 12, 3, 25],
"C3": [5, 2, 17, 3]
})
coeffs = [3, 3, 5, 7]
df['coeff'] = coeffs
df['C1_next'] = df['C1'].shift(-1)
# Fixed Values
rate_low = 2
rate_high = 2
# Helper Functions
def test_bounds(row, lower, rate):
if lower:
bound = row['C1_next'] - rate * row['coeff']
elif not lower:
bound = row['C1_next'] + rate * row['coeff']
return bound
def earliest_bound(x, lower):
rows_to_search = df[df.index > x.name]
if lower:
slice = rows_to_search[rows_to_search['C2'] < x['lower_bound']]
elif not lower:
slice = rows_to_search[rows_to_search['C3'] >= x['higher_bound']]
if len(slice) > 0:
value = slice.index[0]
else:
value = np.NaN
return value
df['lower_bound'] = df.apply(test_bounds, args=(True, rate_low), axis=1)
df['higher_bound'] = df.apply(test_bounds, args=(False, rate_high), axis=1)
df["earliest_lower_bound_row"] = df.apply(earliest_bound, args=(True, ), axis=1)
df["earliest_higher_bound_row"] = df.apply(earliest_bound, args=(False, ), axis=1)
In this case the values returned in earliest_lower_bound_row and earliest_higher_bound_row will be the first row that met that condition. Of course if you do just want an actual 0/1 value, that is straight forward to create using the information in those two columns.
Upvotes: 1
Reputation: 480
You may find use for the pandas function: IntervalIndex, referenced below:
data.index = pd.IntervalIndex.from_tuples(data['C1'], data['C3'])
I = data.index
Your function might be re-written as:
def event(data, coeff, rate_low=2, rate_high=2):
for c in coeff:
i = I.map(c)
low_bound = i - rate_low*c
high_bound = i + rate_high*c
for j in range(k+1, len(data)):
if data.at[ I.get_loc(j), 'C2'] < low_bound:
data.at[i, bot_col_name] = 0
break
if data.at[ I.get_loc(j), 'C3'] >= high_bound:
data.at[i, bot_col_name] = 1
break
Let me know if this helps!
Upvotes: 0