Billiam
Billiam

Reputation: 183

Subtracting a rolling window mean based on value from one column based on another without loops in Pandas

I'm not sure what the word is for what I'm doing, but I can't just use the pandas rolling (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html) function because the window is not a fixed size in terms of database indices. What I'm trying to do this:

I have a dataframe with columns UT (time in hours, but not a datetime object) and WINDS, I want to add a third column that subtracts the mean of all WINDS values that are within 12 hours of the time in the UT column. Currently, I do it like this:

rolsub = []

for i in df['UT']:
    df1 = df[ (df['UT'] > (i-12)) & (df['UT'] < (i+12)) ]
    df2 = df[df['UT'] == i]
    rolsub +=  [float(df2['WINDS'] - df1['WINDS'].mean())]

df['WIND_SUB'] = rolsub

This works fine, but it takes way too long since my dataframe has tens of thousands of entries. There must be a better way to do this, right? Please help!

Upvotes: 0

Views: 431

Answers (1)

user2246849
user2246849

Reputation: 4407

If I understood correctly, you could create a fake DatetimeIndex to use for rolling.

Example data:

import pandas as pd

df = pd.DataFrame({'UT':[0.5, 1, 2, 8, 9, 12, 13, 14, 15, 24, 60, 61, 63, 100],
                   'WINDS':[1, 1, 10, 1, 1, 1, 5, 5, 5, 5, 5, 1, 1, 10]})

print(df)
       UT  WINDS
0     0.5      1
1     1.0      1
2     2.0     10
3     8.0      1
4     9.0      1
5    12.0      1
6    13.0      5
7    14.0      5
8    15.0      5
9    24.0      5
10   60.0      5
11   61.0      1
12   63.0      1
13  100.0     10

Code:

# Fake DatetimeIndex.
df['dt'] = pd.to_datetime('today').normalize() + pd.to_timedelta(df['UT'], unit='h')
df = df.set_index('dt')

df['WINDS_SUB'] = df['WINDS'] - df['WINDS'].rolling('24h', center=True, closed='neither').mean()

print(df)

Which gives:

                        UT  WINDS  WINDS_SUB
dt                                          
2022-05-11 00:30:00    0.5      1  -1.500000
2022-05-11 01:00:00    1.0      1  -1.500000
2022-05-11 02:00:00    2.0     10   7.142857
2022-05-11 08:00:00    8.0      1  -2.333333
2022-05-11 09:00:00    9.0      1  -2.333333
2022-05-11 12:00:00   12.0      1  -2.333333
2022-05-11 13:00:00   13.0      5   0.875000
2022-05-11 14:00:00   14.0      5   1.714286
2022-05-11 15:00:00   15.0      5   1.714286
2022-05-12 00:00:00   24.0      5   0.000000
2022-05-13 12:00:00   60.0      5   2.666667
2022-05-13 13:00:00   61.0      1  -1.333333
2022-05-13 15:00:00   63.0      1  -1.333333
2022-05-15 04:00:00  100.0     10   0.000000

The result on this small test set matches the output of your code. This assumes UT is representing hours from a certain start timepoint, which seems to be the case by looking at your solution.

Runtime:

I tested it on the following df with 30,000 rows:

import numpy as np

df = pd.DataFrame({'UT':range(30000),
                   'WINDS':np.full(30000, 1)})

def loop(df):
    rolsub = []

    for i in df['UT']:
        df1 = df[ (df['UT'] > (i-12)) & (df['UT'] < (i+12)) ]
        df2 = df[df['UT'] == i]
        rolsub +=  [float(df2['WINDS'] - df1['WINDS'].mean())]

    df['WIND_SUB'] = rolsub

def vector(df):
    df['dt'] = pd.to_datetime('today').normalize() + pd.to_timedelta(df['UT'], unit='h')
    df = df.set_index('dt')

    df['WINDS_SUB'] = df['WINDS'] - df['WINDS'].rolling('24h', center=True, closed='neither').mean()

    return df

# 10.1 s ± 171 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit loop(df)

# 1.69 ms ± 71.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vector(df)

So it's about 5,000 times faster.

Upvotes: 1

Related Questions