beyarkay
beyarkay

Reputation: 1023

Pandas DataFrame iterate over a window of rows quickly

I've got a time-series dataframe that looks something like:

datetime                    gesture         left-5-x  ...30 columns omitted
2022-09-27 19:54:54.396680  gesture0255     533 
2022-09-27 19:54:54.403298  gesture0255     534 
2022-09-27 19:54:54.408938  gesture0255     535 
2022-09-27 19:54:54.413995  gesture0255     523 
2022-09-27 19:54:54.418666  gesture0255     522 
... 95 000 rows omitted

And I want to create a new column df['cross_correlation'] which is the function of multiple sequential rows. So the cross_correlation of row i depends on the data from rows i-10 to i+10.

I could do this with df.iterrow(), but that seems like the non-idiomatic version. Is there a function like

df.window(-10, +10).apply(lambda rows: calculate_cross_correlation(rows)

or similar?

EDIT:

Thanks @chris, who pointed me towards df.rolling(), although I now have this example which better reflect the problem I'm having:

Here's a simplified version of the function I want to apply over the moving window. Note that the actual version requires that the input be the full 2D window of shape (window_size, num_columns) but the toy function below doesn't actually need the input to be 2D. I've added an assertion to make sure this is true:

def sum_over_2d(x):
    assert len(x.shape) == 2, f'shape of input is {x.shape} and not of length 2'
    return x.sum()

And now if I use .rolling with .apply

df.rolling(window=10, center=True).apply(
    sum_over_2d
)

, I get an assertion error:

AssertionError: shape of input is (10,) and not of length 2

and if I print the input x before the assertion, I get:

0    533.0
1    534.0
2    535.0
3    523.0
4    522.0
5    526.0
6    510.0
7    509.0
8    502.0
9    496.0
dtype: float64

which is one column from my many-columned dataset. What I'm wanting is for the input x to be a dataframe or 2d numpy array.

Upvotes: 0

Views: 543

Answers (1)

Chris
Chris

Reputation: 29732

IIUC, one way using pandas.Series.rolling.apply.

Example with sum:

df["new"] = df["left-5-x"].rolling(3, center=True, min_periods=1).sum()

Output:

                     datetime      gesture  left-5-x     new      explain
0  2022-09-27 19:54:54.396680  gesture0255       533  1067.0      533+534
1  2022-09-27 19:54:54.403298  gesture0255       534  1602.0  533+534+535
2  2022-09-27 19:54:54.408938  gesture0255       535  1592.0  534+535+523
3  2022-09-27 19:54:54.413995  gesture0255       523  1580.0  535+523+522
4  2022-09-27 19:54:54.418666  gesture0255       522  1045.0      523+522

You can see left-5-x are summed with +1 to -1 neighbors.

Edit:

If you want to use roll-ed dataframe, one way would be iterate over the rolling:

new_df = pd.concat([sum_over_2d(d) for d in df.rolling(window=10)],axis=1).T

Output:

     0    1    2    3
0    0    1    2    3
1    4    6    8   10
2   12   15   18   21
3   24   28   32   36
4   40   45   50   55
5   60   66   72   78
6   84   91   98  105
7  112  120  128  136
8  144  153  162  171
9  180  190  200  210

Or as per @Sandwichnick's comment, you can use method=="table", but only if pass engine=="numba". In other words, your sum_over_2d must be numba compilable (which is beyond the scope of this question and my knowledge)

df.rolling(window=10, center=True, method="table").sum(engine="numba")

Upvotes: 2

Related Questions