Reputation: 367
I have a dataset as a Pandas DataFrame, and I am trying to get the k
largest values within several rolling windows of different sizes.
Simplified problem:
import pandas as pd
import numpy as np
np.random.seed(42)
def GenerateData(N=20, num_cols=2):
X = pd.DataFrame(np.random.rand(N, num_cols))
return X
X = GenerateData()
## >>> X.head()
## 0 1
## 0 0.971595 0.329454
## 1 0.187766 0.138250
## 2 0.573455 0.976918
## 3 0.207987 0.672529
## 4 0.271034 0.549839
My goal is then to get the k
largest values within each rolling window for each column. So if k_largest=3
and the rolling window sizes are windows=[4,7]
, we want the 3 largest values for windows of size 4 and 7. The way I am currently doing this is
def GetKLargestForWindow(windows=[4,7], k_largest=3, raw=False):
laggedVals = []
for L in windows:
for k in range(k_largest):
x_k_max = X.rolling(L).apply(lambda c: sorted(c, reverse=True)[k], raw=raw)
x_k_max = x_k_max.add_prefix( f'W{L}_{k+1}_' )
laggedVals.append( x_k_max )
laggedVals = pd.concat(laggedVals, axis=1).sort_index(axis=1)
return laggedVals
laggedVals = GetKLargestForWindow()
## >>> laggedVals.shape
## (20,12)
## >>> laggedVals.columns
## Index(['W4_1_0', 'W4_1_1', 'W4_2_0', 'W4_2_1', \
## 'W4_3_0', 'W4_3_1', 'W7_1_0','W7_1_1', \
## 'W7_2_0', 'W7_2_1', 'W7_3_0', 'W7_3_1'],dtype='object')
Note that there should be 12 columns total in this example. The column names there indicate W{window_size}_{j}_{col}
where j=1,2,3 corresponding to the 3 largest values of each window size for each column.
However my dataset is very large and I'm looking for a more efficient way to do this as the code takes very long to run. Any suggestions?
Benchmarks:
import timeit
## >>> timeit.timeit('GetKLargestForWindow()', globals=globals(), number=1000)
## 15.590040199999976
## >>> timeit.timeit('GetKLargestForWindow(raw=True)', globals=globals(), number=1000)
## 6.497314199999892
I have mostly solved this - huge speedup (especially in larger datasets when you increase N
, windows
, and k_largest
) by setting raw=True
in the apply-max function.
Upvotes: 1
Views: 676
Reputation: 93181
As always, if you want speed, use numpy as much as possible. Python loops are extremely slow compared to numpy vectorized code:
from numpy.lib.stride_tricks import sliding_window_view
def GetKLargestForWindow_CodeDifferent(windows=[4,7], k_largest=3):
n_row, n_col = X.shape
data = []
for w in windows:
# Create a rolling view of size w for each column in the dataframe
view = sliding_window_view(X, w, axis=0)
# Sort each view, reverse it (so largest first), and take the first
# k_largest elements
view = np.sort(view)[..., ::-1][..., :k_largest]
# Reshape the numpy array for easy conversion into a dataframe
view = np.reshape(view, (n_row - w + 1, -1))
# We know the first `w - 1` rows are all NaN since there are not enough
# data for the rolling operation
data.append(np.vstack([
np.zeros((w - 1, view.shape[1])) + np.nan,
view
]))
# `data` is shaped in this order
cols_1 = [f"W{w}_{k+1}_{col}" for w in windows for col in range(n_col) for k in range(k_largest)]
# But we want the columns in this order for easy comparison with the original code
cols_2 = [f"W{w}_{k+1}_{col}" for w in windows for k in range(k_largest) for col in range(n_col)]
return pd.DataFrame(np.hstack(data), columns=cols_1)[cols_2]
First, let's compare the result:
X = GenerateData(100_000, 2)
a = GetKLargestForWindow(raw=True)
b = GetKLargestForWindow_CodeDifferent()
assert a.compare(b).empty, "a and b are not the same"
Next, let's benchmark them:
%timeit GetKLargestForWindow(raw=True)
5.31 s ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit GetKLargestForWindow_CodeDifferent()
54.1 ms ± 761 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 1
Reputation: 19
You can use the pandas built in function rolling (more info here). This takes in a dataframe and applies a roling window calculation based on a pandas built in, or own defined function (with apply). It only takes a single intiger as a window, or a window BaseIndexer subclass. I believe here you can specify multiple windows for multiple columns, but I find it easyer to loop over the colums.
X = pd.DataFrame([[((-1)**i) * i*10, ((-1)**i) * -i*5] for i in range(20)])
x = pd.DataFrame() #Emtpy dataframe, here roling window will be stored
windows = [4,7]
k = 3
for window, colname in zip(windows,X.columns):
x[colname] = X[colname].rolling(window).max()
print(x.nlargest(k,columns=x.columns)) #find max k values
result
19 180.0 95.0
18 180.0 85.0
17 160.0 85.0
16 160.0 75.0
0 NaN NaN
1 NaN NaN
2 NaN NaN
Upvotes: 0