Reputation: 183
I'm not sure what the word is for what I'm doing, but I can't just use the pandas rolling (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html) function because the window is not a fixed size in terms of database indices. What I'm trying to do this:
I have a dataframe with columns UT (time in hours, but not a datetime object) and WINDS, I want to add a third column that subtracts the mean of all WINDS values that are within 12 hours of the time in the UT column. Currently, I do it like this:
rolsub = []
for i in df['UT']:
df1 = df[ (df['UT'] > (i-12)) & (df['UT'] < (i+12)) ]
df2 = df[df['UT'] == i]
rolsub += [float(df2['WINDS'] - df1['WINDS'].mean())]
df['WIND_SUB'] = rolsub
This works fine, but it takes way too long since my dataframe has tens of thousands of entries. There must be a better way to do this, right? Please help!
Upvotes: 0
Views: 431
Reputation: 4407
If I understood correctly, you could create a fake DatetimeIndex
to use for rolling.
import pandas as pd
df = pd.DataFrame({'UT':[0.5, 1, 2, 8, 9, 12, 13, 14, 15, 24, 60, 61, 63, 100],
'WINDS':[1, 1, 10, 1, 1, 1, 5, 5, 5, 5, 5, 1, 1, 10]})
print(df)
UT WINDS
0 0.5 1
1 1.0 1
2 2.0 10
3 8.0 1
4 9.0 1
5 12.0 1
6 13.0 5
7 14.0 5
8 15.0 5
9 24.0 5
10 60.0 5
11 61.0 1
12 63.0 1
13 100.0 10
# Fake DatetimeIndex.
df['dt'] = pd.to_datetime('today').normalize() + pd.to_timedelta(df['UT'], unit='h')
df = df.set_index('dt')
df['WINDS_SUB'] = df['WINDS'] - df['WINDS'].rolling('24h', center=True, closed='neither').mean()
print(df)
Which gives:
UT WINDS WINDS_SUB
dt
2022-05-11 00:30:00 0.5 1 -1.500000
2022-05-11 01:00:00 1.0 1 -1.500000
2022-05-11 02:00:00 2.0 10 7.142857
2022-05-11 08:00:00 8.0 1 -2.333333
2022-05-11 09:00:00 9.0 1 -2.333333
2022-05-11 12:00:00 12.0 1 -2.333333
2022-05-11 13:00:00 13.0 5 0.875000
2022-05-11 14:00:00 14.0 5 1.714286
2022-05-11 15:00:00 15.0 5 1.714286
2022-05-12 00:00:00 24.0 5 0.000000
2022-05-13 12:00:00 60.0 5 2.666667
2022-05-13 13:00:00 61.0 1 -1.333333
2022-05-13 15:00:00 63.0 1 -1.333333
2022-05-15 04:00:00 100.0 10 0.000000
The result on this small test set matches the output of your code. This assumes UT
is representing hours from a certain start timepoint, which seems to be the case by looking at your solution.
I tested it on the following df
with 30,000 rows:
import numpy as np
df = pd.DataFrame({'UT':range(30000),
'WINDS':np.full(30000, 1)})
def loop(df):
rolsub = []
for i in df['UT']:
df1 = df[ (df['UT'] > (i-12)) & (df['UT'] < (i+12)) ]
df2 = df[df['UT'] == i]
rolsub += [float(df2['WINDS'] - df1['WINDS'].mean())]
df['WIND_SUB'] = rolsub
def vector(df):
df['dt'] = pd.to_datetime('today').normalize() + pd.to_timedelta(df['UT'], unit='h')
df = df.set_index('dt')
df['WINDS_SUB'] = df['WINDS'] - df['WINDS'].rolling('24h', center=True, closed='neither').mean()
return df
# 10.1 s ± 171 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit loop(df)
# 1.69 ms ± 71.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit vector(df)
So it's about 5,000 times faster.
Upvotes: 1