Rolling regression with ragged time series

I'm trying to calculate the 10 day momentum of a statistic for a sports player that takes into account the date of the match each statistic was recorded for.

I'm not from a stats background but so far I'm cobbled together a theory that I could perform a linear regression over a rolling 10 day window where X = days elapsed from the most recent date in the window and y = the statistic. As each statistic is recorded for a given match then the window needs to exclude the statistic for the row that the output of the regression will be recorded in. The rolling regression must also be performed by player.

Here's a sample dataset - I'm looking to calculate the momentum column:

+----+--------+------------+------+----------+
| id | player | match_date | stat | momentum |
+----+--------+------------+------+----------+
|  1 |      1 | 2022-01-01 | 1500 | NaN      |
|  2 |      1 | 2022-01-03 | 1600 | NaN      |
|  3 |      1 | 2022-01-10 | 1500 | -50      |
|  4 |      1 | 2022-01-11 | 1800 | 3.73     |
|  5 |      1 | 2022-01-18 | 2100 | -300     |
|  6 |      2 | 2022-01-01 | 2100 | NaN      |
|  7 |      2 | 2022-01-03 | 1800 | NaN      |
|  8 |      2 | 2022-01-10 | 1600 | 150      |
|  9 |      2 | 2022-01-11 | 1500 | 48.51    |
| 10 |      2 | 2022-01-18 | 1500 | 100      |
+----+--------+------------+------+----------+

To illustrate how the momentum column is calculated I've added in the manual calculation of the days elapsed in the tables below:

+----+--------+------------+------+------------------+------------------+------------------+------------------+------------------+----------+
| id | player | match_date | stat | id1_days_elapsed | id2_days_elapsed | id3_days_elapsed | id4_days_elapsed | id5_days_elapsed | momentum |
+----+--------+------------+------+------------------+------------------+------------------+------------------+------------------+----------+
|  1 |      1 | 2022-01-01 | 1500 | NA               |                2 |                9 |               10 |                  | NaN      |
|  2 |      1 | 2022-01-03 | 1600 |                  |                  |                7 |                8 |                  | NaN      |
|  3 |      1 | 2022-01-10 | 1500 |                  |                  |                  |                1 |                8 | -50      |
|  4 |      1 | 2022-01-11 | 1800 |                  |                  |                  |                  |                7 | 3.73     |
|  5 |      1 | 2022-01-18 | 2100 |                  |                  |                  |                  |                  | -300     |
+----+--------+------------+------+------------------+------------------+------------------+------------------+------------------+----------+

+----+--------+------------+------+------------------+------------------+------------------+------------------+-------------------+----------+
| id | player | match_date | stat | id6_days_elapsed | id7_days_elapsed | id8_days_elapsed | id9_days_elapsed | id10_days_elapsed | momentum |
+----+--------+------------+------+------------------+------------------+------------------+------------------+-------------------+----------+
|  6 |      2 | 2022-01-01 | 2100 | NA               |                2 |                9 |               10 |                   | NaN      |
|  7 |      2 | 2022-01-03 | 1800 |                  |                  |                7 |                8 |                   | NaN      |
|  8 |      2 | 2022-01-10 | 1600 |                  |                  |                  |                1 |                 8 | 150      |
|  9 |      2 | 2022-01-11 | 1500 |                  |                  |                  |                  |                 7 | 48.51    |
| 10 |      2 | 2022-01-18 | 1500 |                  |                  |                  |                  |                   | 100      |
+----+--------+------------+------+------------------+------------------+------------------+------------------+-------------------+----------+

Most answers I've read here point to statsmodels.regression.rolling.RollingOLS but reading through the documentation this can't cope with a ragged time series which I have. I've read one answer that uses resampling when calculating a rolling mean but I'm not sure of what implications there would be for a regression.

Below is my pseudo attempt using groupby.rolling.apply:

import numpy as np
import datetime as dt
import pandas as pd
from sklearn.linear_model import LinearRegression

dates = [
    dt.datetime(2022, 1, 1),
    dt.datetime(2022, 1, 3),
    dt.datetime(2022, 1, 10),
    dt.datetime(2022, 1, 11),
    dt.datetime(2022, 1, 18),
]
df = pd.DataFrame(
    {
        "id": range(1, 11),
        "player": [1 for i in range(5)] + [2 for i in range(5)],
        "match_date": dates + dates,
        "stat": (1500, 1600, 1500, 1800, 2100, 2100, 1800, 1600, 1500, 1500)
    }
)


def calc_regression(df: pd.DataFrame, col: str):
    df = df[~df[col].isnull()]
    df["days_diff"] = df["match_date"].diff() / pd.to_timedelta(1, 'd')
    days_elapsed = df.loc[::-1, "days_diff"].cumsum()[::-1].values.reshape((-1, 1))
    model = LinearRegression()
    return model.fit(days_elapsed, df[col].values).coef_[0]


df["momentum"] = (
    df
    .groupby("player")
    .rolling("10d", on="match_date", closed="left")
    .apply(calc_regression, args=("stat", ))
)

I'm aware that this doesn't work because rolling.apply applies the calc_regression on a column by column basis. However, I thought it might be useful to explain what I'm trying to do.

In the real DataFrame there are 3m rows and 80k players so a for loop will take an age - is there an alternative?

Upvotes: 1

Answers (2)

Jossy

Reputation: 1021

I managed to adapt a useful Q&A here on how to work with groupby, rolling and apply all together across combinations of single and multiple input and output columns. On my journey I also discovered a post on np.polyfit for the regression which was useful. My final code:

def resample_and_roll(df: pd.DataFrame):
    df.reset_index(inplace=True)
    df.set_index("match_date", drop=False, inplace=True)
    df["roll_regression"] = (
        df["stat"]
        .rolling("11d", min_periods=3)
        .apply(calc_regression, args=(df, ), raw=False)
    )
    df.set_index("index", inplace=True)
    return df["roll_regression"]


def calc_regression(s: pd.Series, df: pd.DataFrame):
    y_values = s[~np.isnan(s)][:-1].values
    match_date = df.loc[s.index, "match_date"]
    days_diff = match_date.diff().dt.days.shift(-1)
    days_elapsed = days_diff.loc[::-1].cumsum()[::-1].values[:-1]
    return np.polyfit(days_elapsed, y_values, deg=1)[0]


df["momentum"] = df.groupby("player").apply(resample_and_roll).droplevel(0)

Which outputs:

   id  player match_date  stat    momentum
0   1       1 2022-01-01  1500         NaN
1   2       1 2022-01-03  1600         NaN
2   3       1 2022-01-10  1500  -50.000000
3   4       1 2022-01-11  1800    3.731343
4   5       1 2022-01-18  2100 -300.000000
5   6       2 2022-01-01  2100         NaN
6   7       2 2022-01-03  1800         NaN
7   8       2 2022-01-10  1600  150.000000
8   9       2 2022-01-11  1500   48.507463
9  10       2 2022-01-18  1500  100.000000

Upvotes: 0

Bob

Reputation: 14664

Sice I don't have your data what I will give you is a general function that will

group by a given column find the parameters to approximate the column given by y parameter as a linear function of the column given by the x parameter, based on the formulas from wikipedia's Simple linear regression page

def grouped_rolling_linear_regression(df, window=10, groups='l', x='x', y='y'):
    '''
    Computes linear regression parameters for a rolling window
    withing grouped data where x
     - df: the input data frame
     - window: the wolling window
     - groups: the column 
    
    Returns a dataframe with columns a, b, so that
      a*x+b ~ y
    '''
    y = df[y]
    x = df[x]
    terms = pd.DataFrame({groups: df[groups], 'y': y, 'x': x, 'xx': x*x, 'xy': x*y})
    terms[1] = 1;
    s = terms.groupby(groups).rolling(window).sum()
    # compute sum((x-mean(x))*(y-mean(y)))
    Rxy = s['xy'] - s['x']*s['y'] / s[1]
    # compute sum((x-mean(x))**2)
    Rxx = s['xx'] - s['x']**2 / s[1]
    a = Rxy / Rxx
    b = (s['y'] - a * s['x']) / s[1]
    return pd.DataFrame({'a': a, 'b': b}).dropna();

Validation

You can check that it works with a small example like this

s = pd.DataFrame([{'l':1, 'x': i, 'y': i} for i in range(100)] + 
                 [{'l':2, 'x': i, 'y': -3.14*i + 482} for i in range(50)])

grouped_rolling_linear_regression(s)

This data has two groups, each with one equation, one has 100 samples, the other has 50 samples.

Performance

Since you mentioned that your concern was about taking ages to compute everything having 3M samples in 80k groups, I ran one test with random data. The generated data has 3.2M samples in 80k groups.

nplayers=80000
a,b = np.random.randn(2, 1, 40)
x = np.random.randn(80000, 40)
l = np.arange(len(x)).reshape(-1,1).repeat(40, axis=1)
y = a*x + b
s = pd.DataFrame()
for k,v in [('x',x), ('y',y), ('l',l)]:
    s[k] = v.reshape(-1)

Using the default window size of 10, this will compute 2.4M valid linear regressions, here it finished in 19 seconds.

u = grouped_rolling_linear_regression(s)

Applied to your specific problem

The only thing you have to do in order to run the regression with your data is to create a different column with a numeric type, being the difference from a date to any initial date. For instance, the earliest date in your data frame.

import pandas as pd
import datetime as dt
dates = [
    dt.datetime(2022, 1, 1),
    dt.datetime(2022, 1, 3),
    dt.datetime(2022, 1, 10),
    dt.datetime(2022, 1, 11),
    dt.datetime(2022, 1, 18),
]
df = pd.DataFrame(
    {
        "id": range(1, 11),
        "player": [1 for i in range(5)] + [2 for i in range(5)],
        "match_date": dates + dates,
        "stat": (1500, 1600, 1500, 1800, 2100, 2100, 1800, 1600, 1500, 1500)
    }
)

df['numeric-days'] = (df["match_date"] - min(df['match_date'])) / pd.to_timedelta(1, 'd')
grouped_rolling_linear_regression(df, x='numeric-days', y='stat', groups='player', window=4)

Given that the data has two groups with 5 entries each, and I choose window 4, it will fit two different lines to each category.