Adam
Adam

Reputation: 367

Python Get k largest values of a Rolling Window for each column

I have a dataset as a Pandas DataFrame, and I am trying to get the k largest values within several rolling windows of different sizes.

Simplified problem:

import pandas as pd
import numpy as np
np.random.seed(42)

def GenerateData(N=20, num_cols=2):
    X = pd.DataFrame(np.random.rand(N, num_cols))
    return X
X = GenerateData()

## >>> X.head()
##    0         1
## 0  0.971595  0.329454
## 1  0.187766  0.138250
## 2  0.573455  0.976918
## 3  0.207987  0.672529
## 4  0.271034  0.549839

My goal is then to get the k largest values within each rolling window for each column. So if k_largest=3 and the rolling window sizes are windows=[4,7], we want the 3 largest values for windows of size 4 and 7. The way I am currently doing this is

def GetKLargestForWindow(windows=[4,7], k_largest=3, raw=False):
    laggedVals = []
    for L in windows:
        for k in range(k_largest):
            x_k_max = X.rolling(L).apply(lambda c: sorted(c, reverse=True)[k], raw=raw)
            x_k_max = x_k_max.add_prefix( f'W{L}_{k+1}_' )
            laggedVals.append( x_k_max )
    laggedVals = pd.concat(laggedVals, axis=1).sort_index(axis=1)
    return laggedVals
laggedVals = GetKLargestForWindow()

## >>> laggedVals.shape
## (20,12)

## >>> laggedVals.columns
## Index(['W4_1_0', 'W4_1_1', 'W4_2_0', 'W4_2_1', \
##  'W4_3_0', 'W4_3_1', 'W7_1_0','W7_1_1', \
##  'W7_2_0', 'W7_2_1', 'W7_3_0', 'W7_3_1'],dtype='object')

Note that there should be 12 columns total in this example. The column names there indicate W{window_size}_{j}_{col} where j=1,2,3 corresponding to the 3 largest values of each window size for each column.

However my dataset is very large and I'm looking for a more efficient way to do this as the code takes very long to run. Any suggestions?


Benchmarks:

import timeit
## >>> timeit.timeit('GetKLargestForWindow()', globals=globals(), number=1000)
## 15.590040199999976

## >>> timeit.timeit('GetKLargestForWindow(raw=True)', globals=globals(), number=1000)
## 6.497314199999892

Edit

I have mostly solved this - huge speedup (especially in larger datasets when you increase N, windows, and k_largest) by setting raw=True in the apply-max function.

Upvotes: 1

Views: 676

Answers (2)

Code Different
Code Different

Reputation: 93181

As always, if you want speed, use numpy as much as possible. Python loops are extremely slow compared to numpy vectorized code:

from numpy.lib.stride_tricks import sliding_window_view

def GetKLargestForWindow_CodeDifferent(windows=[4,7], k_largest=3):
    n_row, n_col = X.shape

    data = []
    for w in windows:
        # Create a rolling view of size w for each column in the dataframe
        view = sliding_window_view(X, w, axis=0)
        # Sort each view, reverse it (so largest first), and take the first
        # k_largest elements
        view = np.sort(view)[..., ::-1][..., :k_largest]
        # Reshape the numpy array for easy conversion into a dataframe
        view = np.reshape(view, (n_row - w + 1, -1))
        # We know the first `w - 1` rows are all NaN since there are not enough
        # data for the rolling operation
        data.append(np.vstack([
            np.zeros((w - 1, view.shape[1])) + np.nan,
            view
        ]))

    # `data` is shaped in this order
    cols_1 = [f"W{w}_{k+1}_{col}" for w in windows for col in range(n_col) for k in range(k_largest)]
    # But we want the columns in this order for easy comparison with the original code
    cols_2 = [f"W{w}_{k+1}_{col}" for w in windows for k in range(k_largest) for col in range(n_col)]
    
    return pd.DataFrame(np.hstack(data), columns=cols_1)[cols_2]

First, let's compare the result:

X = GenerateData(100_000, 2)
a = GetKLargestForWindow(raw=True)
b = GetKLargestForWindow_CodeDifferent()

assert a.compare(b).empty, "a and b are not the same"

Next, let's benchmark them:

%timeit GetKLargestForWindow(raw=True)
5.31 s ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit GetKLargestForWindow_CodeDifferent()
54.1 ms ± 761 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 1

Arend
Arend

Reputation: 19

You can use the pandas built in function rolling (more info here). This takes in a dataframe and applies a roling window calculation based on a pandas built in, or own defined function (with apply). It only takes a single intiger as a window, or a window BaseIndexer subclass. I believe here you can specify multiple windows for multiple columns, but I find it easyer to loop over the colums.

X = pd.DataFrame([[((-1)**i) * i*10, ((-1)**i) * -i*5] for i in range(20)])
x = pd.DataFrame() #Emtpy dataframe, here roling window will be stored
windows = [4,7]
k = 3
for window, colname in zip(windows,X.columns):
    x[colname] = X[colname].rolling(window).max()

print(x.nlargest(k,columns=x.columns)) #find max k values

result

19  180.0  95.0
18  180.0  85.0
17  160.0  85.0
16  160.0  75.0
0     NaN   NaN
1     NaN   NaN
2     NaN   NaN

Upvotes: 0

Related Questions