Jernej
Jernej

Reputation: 352

Speeding up a pandas iteration looking testing a condition on subsequent elements

Given a pandas dataframe with three columns (C1,C2,C3) and a Series of positive numbers of equal length (coeff) I am computing the fourth column C4, as follows

def event(data, coeff, rate_low=2, rate_high=2):

    bot_col_name = 'C4'

    data[bot_col_name] = -1

    I = data.index 
    for k in range(len(I)-1):
        i = I[k]

        next_val = data.at[ I[k+1], 'C1']
        c = coeff.at[i]

        low_bound = next_val - rate_low*c
        high_bound = next_val + rate_high*c

        for j in range(k+1, len(data)):
            if data.at[ I[j], 'C2'] < low_bound:
                data.at[i, bot_col_name] = 0 
                break

            if data.at[ I[j], 'C3'] >= high_bound:
                data.at[i, bot_col_name] = 1 
                break
    return data

In other words, given a row, we compute a certain upper and lower bound and then set the respective row element depending on whether we first hit the upper bound under C2 or lower bound on C3.

As an example consider the pandas table D

   C1  C2  C3
0  2   5   5
1  10  12   2
2   8   3  17 
3  30  25   3

now if coeff is = [3,3,5,7] then when computing the value for the firs row the low_bound is 10-2*3=4 and the high bound is 10+2*3=16. We now have to find the least index i>0 so that D.loc[i, 'C2'] < 4 or D.loc[i,'C3'] >= 16. We see that the first such i is 1 and since this happen to satisfy the first condition we'd set the new column to 0 for this row.

Unfortunately, the above solution is quite inefficient. I've tried optimizing it by computing the values backwards and trying to cache the results (sometimes one can infer the value of C4 from 'past' values) but unfortunately its not significantly better.

In my experience, the best way to gain maximum performance is to try to express as much as possible within the pandas framework.

Is there any meaningful way that one could optimize the above code?

Edit. Using the code of the accepted answer and substituting the following function gives the best results.

@njit
def get_c4(low_bound, high_bound, c2, c3):
    r1 = np.argwhere( c2 < low_bound )
    r2 = np.argwhere( c3 >= high_bound )

    if len(r1) == 0 and len(r2) == 0:
        return -1
    elif len(r1) == 0:
        return 1
    elif len(r2) == 0:
        return 0

    return int (r1[0] > r2[0])

Upvotes: 4

Views: 289

Answers (4)

villoro
villoro

Reputation: 1549

If you really need a fast solution you should use numba. An alternative to numba would be cython. Both compile your python code in c to make it way faster but tn my opinion numba is more simple and they have more or less the same performance.

Compiling the code to c / Fortran is what makes numpy / pandas internal functions so fast. More info at the pandas documentation.

Let's first create the example:

import numpy as np
import pandas as pd
from numba import njit

df = pd.DataFrame({
    'C1': [2, 10, 8, 30],
    'C2': [5, 12, 3, 25],
    'C3': [5, 2, 17, 3]
})
coeff = pd.Series([3, 3, 5, 7])

And then by transforming to numba the code for the answer we get:

@njit
def event_v(data, coeff, rate_low=2, rate_high=2):

    out = -np.ones(len(data), dtype=np.int8)

    for k in range(len(data) - 1):

        next_val = data[k + 1, 0]
        c = coeff[k]

        low_bound = next_val - rate_low * c
        high_bound = next_val + rate_high * c

        for j in range(k + 1, len(data)):
            if data[j, 1] < low_bound:
                out[k] = 0
                break

            if data[j, 2] >= high_bound:
                out[k] = 1
                break

    return out

df["C4"] = event_v(df.values, coeff.values)

TEST with 10 000 rows:

n = 10_000
df = pd.DataFrame(np.random.randint(30, size=[n, 3]), columns=["C1", "C2", "C3"])
coeff = pd.Series(np.random.randint(10, size=n))

%timeit event_v(df.values, coeff.values)
3.39 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit event(df, coeff) # Code from the question
28.4 s ± 1.02 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

It is around 8500 times faster

TEST with 1 000 000 rows:

n = 1_000_000
df = pd.DataFrame(np.random.randint(30, size=[n, 3]), columns=["C1", "C2", "C3"])
coeff = pd.Series(np.random.randint(10, size=n))
%timeit event_v(df.values, coeff.values)
27.6 s ± 1.16 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

I tried running that with the code of the question and after more than 2 hour the %timeit did not finish.

Upvotes: 2

Araldo van de Kraats
Araldo van de Kraats

Reputation: 304

next_value, low_bound and high_bound can easily be vectorized and their calculation is very fast. The second part is not easy to vectorize, as it potentially needs to scan the whole array for each row. A slight improvement (that becomes more significant for large n) over your implementation is obtained by doing the comparison in numpy arrays.

def get_c4(low_bound, high_bound, c2, c3):
    for idx in range(len(c2)):
        if c2[idx] < low_bound:
            return 0
        if c3[idx] >= high_bound:
            return 1
    return -1

def event_new(data: pd.DataFrame, coeff, rate_low=2, rate_high=2):
    data['next_val'] = data['C1'].shift(periods=-1).ffill().astype('int')
    data['low_bound'] = (data['next_val'] - rate_low * coeff).astype('int')
    data['high_bound'] = (data['next_val'] + rate_high * coeff).astype('int')
    c2 = data['C2'].to_numpy()
    c3 = data['C3'].to_numpy()
    data['C4'] = data.apply(lambda x: get_c4(x.low_bound, x.high_bound, c2[data.index.get_loc(x) + 1:], c3[data.index.get_loc(x) + 1:]), axis=1)
    data.drop(columns=['next_val', 'low_bound', 'high_bound'])
    return data

Benchmarking code:

for n in [1e2, 1e3, 1e4, 1e5, 1e6]:
    n = int(n)
    df = pd.DataFrame({'C1': random_list(n=n), 'C2': random_list(n=n), 'C3': random_list(n=n)})
    coeff = pd.Series(random_list(start=2, stop=7, n=n))

    print(f"n={n}:")
    print(f"Time org: {timeit.timeit(lambda: event(df.copy(), coeff), number=1):.3f} seconds")
    print(f"Time new: {timeit.timeit(lambda: event_new(df.copy(), coeff), number=1):.3f} seconds")

Output:

n=100:
Time org: 0.007 seconds
Time new: 0.012 seconds
n=1000:
Time org: 0.070 seconds
Time new: 0.048 seconds
n=10000:
Time org: 0.854 seconds
Time new: 0.493 seconds
n=100000:
Time org: 7.565 seconds
Time new: 4.456 seconds
n=1000000:
Time org: 216.408 seconds
Time new: 45.199 seconds

Upvotes: 1

Brett Romero
Brett Romero

Reputation: 353

Are you sure the algorithm is working exactly how you think it is? Particularly the inner loop where you look for the next row that has C2 < the low bound or C3 >= the high bound (e.g. what happens if both conditions are met by the same row?). Anyway, assuming it is, generally the key to speeding up any algorithm in Pandas, as you have hinted at, is putting everything into a DataFrame and then using column based operations. I would suggest something like the following:

# Setup DataFrame
df = pd.DataFrame({
    "C1": [2, 10, 8, 30],  
    "C2": [5, 12, 3, 25],
    "C3": [5, 2, 17, 3]
})

coeffs = [3, 3, 5, 7]

df['coeff'] = coeffs
df['C1_next'] = df['C1'].shift(-1)


# Fixed Values
rate_low = 2
rate_high = 2 


# Helper Functions
def test_bounds(row, lower, rate):

    if lower:
        bound = row['C1_next'] - rate * row['coeff']
    elif not lower:
        bound = row['C1_next'] + rate * row['coeff']

    return bound


def earliest_bound(x, lower):
    rows_to_search = df[df.index > x.name]

    if lower:
        slice = rows_to_search[rows_to_search['C2'] < x['lower_bound']]
    elif not lower:
        slice = rows_to_search[rows_to_search['C3'] >= x['higher_bound']]

    if len(slice) > 0:
        value = slice.index[0]
    else:
        value = np.NaN

    return value



df['lower_bound'] = df.apply(test_bounds, args=(True, rate_low), axis=1)
df['higher_bound'] = df.apply(test_bounds, args=(False, rate_high), axis=1)

df["earliest_lower_bound_row"] = df.apply(earliest_bound, args=(True, ), axis=1)
df["earliest_higher_bound_row"] = df.apply(earliest_bound, args=(False, ), axis=1)

In this case the values returned in earliest_lower_bound_row and earliest_higher_bound_row will be the first row that met that condition. Of course if you do just want an actual 0/1 value, that is straight forward to create using the information in those two columns.

Upvotes: 1

Ramsha Siddiqui
Ramsha Siddiqui

Reputation: 480

You may find use for the pandas function: IntervalIndex, referenced below:

data.index = pd.IntervalIndex.from_tuples(data['C1'], data['C3'])
I = data.index

Your function might be re-written as:

def event(data, coeff, rate_low=2, rate_high=2):

   for c in coeff:

        i = I.map(c)

        low_bound = i - rate_low*c
        high_bound = i + rate_high*c

        for j in range(k+1, len(data)):
            if data.at[ I.get_loc(j), 'C2'] < low_bound:
                data.at[i, bot_col_name] = 0 
                break

            if data.at[ I.get_loc(j), 'C3'] >= high_bound:
                data.at[i, bot_col_name] = 1 
                break

Let me know if this helps!

Upvotes: 0

Related Questions