Reputation: 4537
I have an irregular dataset where some indices show up multiple times and now want to calculate a floating median, once based on a timespan before the epoch, and once based on a timespan after the epoch.
Pandas rolling
option would be good, however, it does not work well in case some indices occur multiple times...
I have written a function that does calculate my desired result. However, I suppose it is probably not the most efficient implementation and probably there is a more elegant solution using pandas directly.
Here is the code:
fun
is my implementation based on the previous 2 days,
fun2
is the same based on the next 2 days.
rolling
tries to do the same but gives a different result in case some indices show up more often than once.
import numpy as np
import pandas as pd
def dummy_data():
idx = np.array([pd.Timestamp(year=2021, month=1, day=1, hour=1),
pd.Timestamp(year=2021, month=1, day=2, hour=2),
pd.Timestamp(year=2021, month=1, day=3, hour=5),
pd.Timestamp(year=2021, month=1, day=3, hour=5),
pd.Timestamp(year=2021, month=1, day=3, hour=5),
pd.Timestamp(year=2021, month=1, day=8, hour=15),
pd.Timestamp(year=2021, month=1, day=9, hour=18),
pd.Timestamp(year=2021, month=1, day=10, hour=14),
])
data = np.array([1, 2, 3, 4, 5, 6, 7, 8])
return pd.DataFrame(data, index=idx, columns=["l"])
def rolling_median_irregular(ds, left, right):
res = pd.Series(index=ds.index, dtype=np.float64)
for t in ds.index:
val = ds.loc[(ds.index >= t - left) & (ds.index <= t + right)].median()
res.loc[t] = val
return res
if __name__ == "__main__":
df = dummy_data()
df["fun"] = rolling_median_irregular(df["l"], left=pd.Timedelta(days=2), right=pd.Timedelta(days=0))
df["rolling"] = df["l"].rolling("2D").median()
df["fun2"] = rolling_median_irregular(df["l"], left=pd.Timedelta(days=0), right=pd.Timedelta(days=2))
df["rolling2"] = df.loc[::-1, 'l'].rolling("2D").median().loc[::-1]
print(df.head(10))
The problem with rolling
is that for the 2021-01-03
index, it will not work properly because it does not understand that this index shows up multiple times.
Compare the 3rd line. The correct median would be 3.5
because the timespan covers the values [2 3 4 5]
. The rolling
function lists 2.5
because it thinks the timespan only covers the values [2 3]
.
At the 5th row, the rolling
is correct because now it considers all 2021-01-03
data.
l fun rolling fun2 rolling2
2021-01-01 01:00:00 1 1.0 1.0 1.5 1.5
2021-01-02 02:00:00 2 1.5 1.5 3.5 3.5
2021-01-03 05:00:00 3 3.5 2.5 4.0 4.0
2021-01-03 05:00:00 4 3.5 3.0 4.0 4.5
2021-01-03 05:00:00 5 3.5 3.5 4.0 5.0
2021-01-08 15:00:00 6 6.0 6.0 7.0 7.0
2021-01-09 18:00:00 7 6.5 6.5 7.5 7.5
2021-01-10 14:00:00 8 7.0 7.0 8.0 8.0
I would be looking for a better (faster/easier/more flexible) implementation of my rolling_median_irregular
implementation.
It is not important if the outer edge (the one that is not the reference epoch) is included or not in the calculations because my data is irregular down to the millisecond thus there will almost never be the case that one epoch is directly at the edge.
Upvotes: 2
Views: 132
Reputation: 29635
You can use the rolling
function, but you'll need to remove the "wrong" rows by using duplicated
on the index, once using keep='last' and once 'first' depending on the direction of your rolling. Then thanks to index alignment, it populates the wanted value even for duplicates.
df['roll_test'] = df['l'].rolling("2D").median()[~df.index.duplicated(keep='last')]
df['roll_test2'] = (df.loc[::-1, 'l'].rolling("2D").median()
.loc[::-1][~df.index.duplicated(keep='first')])
print(df)
l fun rolling fun2 rolling2 roll_test roll_test2
2021-01-01 01:00:00 1 1.0 1.0 1.5 1.5 1.0 1.5
2021-01-02 02:00:00 2 1.5 1.5 3.5 3.5 1.5 3.5
2021-01-03 05:00:00 3 3.5 2.5 4.0 4.0 3.5 4.0
2021-01-03 05:00:00 4 3.5 3.0 4.0 4.5 3.5 4.0
2021-01-03 05:00:00 5 3.5 3.5 4.0 5.0 3.5 4.0
2021-01-08 15:00:00 6 6.0 6.0 7.0 7.0 6.0 7.0
2021-01-09 18:00:00 7 6.5 6.5 7.5 7.5 6.5 7.5
2021-01-10 14:00:00 8 7.0 7.0 8.0 8.0 7.0 8.0
Upvotes: 1