user7431005
user7431005

Reputation: 4537

efficient way to calculate rolling median using pandas on irregular dataset with multiple occurences of same index

I have an irregular dataset where some indices show up multiple times and now want to calculate a floating median, once based on a timespan before the epoch, and once based on a timespan after the epoch.

Pandas rolling option would be good, however, it does not work well in case some indices occur multiple times...

I have written a function that does calculate my desired result. However, I suppose it is probably not the most efficient implementation and probably there is a more elegant solution using pandas directly.

Here is the code: fun is my implementation based on the previous 2 days, fun2 is the same based on the next 2 days. rolling tries to do the same but gives a different result in case some indices show up more often than once.

import numpy as np
import pandas as pd

def dummy_data():
    idx = np.array([pd.Timestamp(year=2021, month=1, day=1, hour=1),
                    pd.Timestamp(year=2021, month=1, day=2, hour=2),
                    pd.Timestamp(year=2021, month=1, day=3, hour=5),
                    pd.Timestamp(year=2021, month=1, day=3, hour=5),
                    pd.Timestamp(year=2021, month=1, day=3, hour=5),
                    pd.Timestamp(year=2021, month=1, day=8, hour=15),
                    pd.Timestamp(year=2021, month=1, day=9, hour=18),
                    pd.Timestamp(year=2021, month=1, day=10, hour=14),
                    ])
    data = np.array([1, 2, 3, 4, 5, 6, 7, 8])
    return pd.DataFrame(data, index=idx, columns=["l"])

def rolling_median_irregular(ds, left, right):
    res = pd.Series(index=ds.index, dtype=np.float64)
    for t in ds.index:
        val = ds.loc[(ds.index >= t - left) & (ds.index <= t + right)].median()
        res.loc[t] = val
    return res

if __name__ == "__main__":
    df = dummy_data()
    df["fun"] = rolling_median_irregular(df["l"], left=pd.Timedelta(days=2), right=pd.Timedelta(days=0))
    df["rolling"] = df["l"].rolling("2D").median()
    df["fun2"] = rolling_median_irregular(df["l"], left=pd.Timedelta(days=0), right=pd.Timedelta(days=2))
    df["rolling2"] = df.loc[::-1, 'l'].rolling("2D").median().loc[::-1]

    print(df.head(10))

The problem with rolling is that for the 2021-01-03 index, it will not work properly because it does not understand that this index shows up multiple times. Compare the 3rd line. The correct median would be 3.5 because the timespan covers the values [2 3 4 5]. The rolling function lists 2.5 because it thinks the timespan only covers the values [2 3]. At the 5th row, the rolling is correct because now it considers all 2021-01-03 data.

                     l  fun  rolling  fun2  rolling2
2021-01-01 01:00:00  1  1.0      1.0   1.5       1.5
2021-01-02 02:00:00  2  1.5      1.5   3.5       3.5
2021-01-03 05:00:00  3  3.5      2.5   4.0       4.0
2021-01-03 05:00:00  4  3.5      3.0   4.0       4.5
2021-01-03 05:00:00  5  3.5      3.5   4.0       5.0
2021-01-08 15:00:00  6  6.0      6.0   7.0       7.0
2021-01-09 18:00:00  7  6.5      6.5   7.5       7.5
2021-01-10 14:00:00  8  7.0      7.0   8.0       8.0

I would be looking for a better (faster/easier/more flexible) implementation of my rolling_median_irregular implementation.

It is not important if the outer edge (the one that is not the reference epoch) is included or not in the calculations because my data is irregular down to the millisecond thus there will almost never be the case that one epoch is directly at the edge.

Upvotes: 2

Views: 132

Answers (1)

Ben.T
Ben.T

Reputation: 29635

You can use the rolling function, but you'll need to remove the "wrong" rows by using duplicated on the index, once using keep='last' and once 'first' depending on the direction of your rolling. Then thanks to index alignment, it populates the wanted value even for duplicates.

df['roll_test'] = df['l'].rolling("2D").median()[~df.index.duplicated(keep='last')]
df['roll_test2'] = (df.loc[::-1, 'l'].rolling("2D").median()
                      .loc[::-1][~df.index.duplicated(keep='first')])
print(df)
                     l  fun  rolling  fun2  rolling2  roll_test  roll_test2
2021-01-01 01:00:00  1  1.0      1.0   1.5       1.5        1.0         1.5
2021-01-02 02:00:00  2  1.5      1.5   3.5       3.5        1.5         3.5
2021-01-03 05:00:00  3  3.5      2.5   4.0       4.0        3.5         4.0
2021-01-03 05:00:00  4  3.5      3.0   4.0       4.5        3.5         4.0
2021-01-03 05:00:00  5  3.5      3.5   4.0       5.0        3.5         4.0
2021-01-08 15:00:00  6  6.0      6.0   7.0       7.0        6.0         7.0
2021-01-09 18:00:00  7  6.5      6.5   7.5       7.5        6.5         7.5
2021-01-10 14:00:00  8  7.0      7.0   8.0       8.0        7.0         8.0

Upvotes: 1

Related Questions