Reputation: 1021
I'm trying to calculate the 10 day momentum of a statistic for a sports player that takes into account the date of the match each statistic was recorded for.
I'm not from a stats background but so far I'm cobbled together a theory that I could perform a linear regression over a rolling 10 day window where X
= days elapsed from the most recent date
in the window and y
= the statistic
. As each statistic is recorded for a given match then the window needs to exclude the statistic for the row that the output of the regression will be recorded in. The rolling regression must also be performed by player
.
Here's a sample dataset - I'm looking to calculate the momentum
column:
+----+--------+------------+------+----------+
| id | player | match_date | stat | momentum |
+----+--------+------------+------+----------+
| 1 | 1 | 2022-01-01 | 1500 | NaN |
| 2 | 1 | 2022-01-03 | 1600 | NaN |
| 3 | 1 | 2022-01-10 | 1500 | -50 |
| 4 | 1 | 2022-01-11 | 1800 | 3.73 |
| 5 | 1 | 2022-01-18 | 2100 | -300 |
| 6 | 2 | 2022-01-01 | 2100 | NaN |
| 7 | 2 | 2022-01-03 | 1800 | NaN |
| 8 | 2 | 2022-01-10 | 1600 | 150 |
| 9 | 2 | 2022-01-11 | 1500 | 48.51 |
| 10 | 2 | 2022-01-18 | 1500 | 100 |
+----+--------+------------+------+----------+
To illustrate how the momentum
column is calculated I've added in the manual calculation of the days elapsed in the tables below:
+----+--------+------------+------+------------------+------------------+------------------+------------------+------------------+----------+
| id | player | match_date | stat | id1_days_elapsed | id2_days_elapsed | id3_days_elapsed | id4_days_elapsed | id5_days_elapsed | momentum |
+----+--------+------------+------+------------------+------------------+------------------+------------------+------------------+----------+
| 1 | 1 | 2022-01-01 | 1500 | NA | 2 | 9 | 10 | | NaN |
| 2 | 1 | 2022-01-03 | 1600 | | | 7 | 8 | | NaN |
| 3 | 1 | 2022-01-10 | 1500 | | | | 1 | 8 | -50 |
| 4 | 1 | 2022-01-11 | 1800 | | | | | 7 | 3.73 |
| 5 | 1 | 2022-01-18 | 2100 | | | | | | -300 |
+----+--------+------------+------+------------------+------------------+------------------+------------------+------------------+----------+
+----+--------+------------+------+------------------+------------------+------------------+------------------+-------------------+----------+
| id | player | match_date | stat | id6_days_elapsed | id7_days_elapsed | id8_days_elapsed | id9_days_elapsed | id10_days_elapsed | momentum |
+----+--------+------------+------+------------------+------------------+------------------+------------------+-------------------+----------+
| 6 | 2 | 2022-01-01 | 2100 | NA | 2 | 9 | 10 | | NaN |
| 7 | 2 | 2022-01-03 | 1800 | | | 7 | 8 | | NaN |
| 8 | 2 | 2022-01-10 | 1600 | | | | 1 | 8 | 150 |
| 9 | 2 | 2022-01-11 | 1500 | | | | | 7 | 48.51 |
| 10 | 2 | 2022-01-18 | 1500 | | | | | | 100 |
+----+--------+------------+------+------------------+------------------+------------------+------------------+-------------------+----------+
Most answers I've read here point to statsmodels.regression.rolling.RollingOLS
but reading through the documentation this can't cope with a ragged time series which I have. I've read one answer that uses resampling when calculating a rolling mean but I'm not sure of what implications there would be for a regression.
Below is my pseudo attempt using groupby.rolling.apply
:
import numpy as np
import datetime as dt
import pandas as pd
from sklearn.linear_model import LinearRegression
dates = [
dt.datetime(2022, 1, 1),
dt.datetime(2022, 1, 3),
dt.datetime(2022, 1, 10),
dt.datetime(2022, 1, 11),
dt.datetime(2022, 1, 18),
]
df = pd.DataFrame(
{
"id": range(1, 11),
"player": [1 for i in range(5)] + [2 for i in range(5)],
"match_date": dates + dates,
"stat": (1500, 1600, 1500, 1800, 2100, 2100, 1800, 1600, 1500, 1500)
}
)
def calc_regression(df: pd.DataFrame, col: str):
df = df[~df[col].isnull()]
df["days_diff"] = df["match_date"].diff() / pd.to_timedelta(1, 'd')
days_elapsed = df.loc[::-1, "days_diff"].cumsum()[::-1].values.reshape((-1, 1))
model = LinearRegression()
return model.fit(days_elapsed, df[col].values).coef_[0]
df["momentum"] = (
df
.groupby("player")
.rolling("10d", on="match_date", closed="left")
.apply(calc_regression, args=("stat", ))
)
I'm aware that this doesn't work because rolling.apply
applies the calc_regression
on a column by column basis. However, I thought it might be useful to explain what I'm trying to do.
In the real DataFrame there are 3m rows and 80k players so a for loop will take an age - is there an alternative?
Upvotes: 1
Views: 713
Reputation: 1021
I managed to adapt a useful Q&A here on how to work with groupby
, rolling
and apply
all together across combinations of single and multiple input and output columns. On my journey I also discovered a post on np.polyfit
for the regression which was useful. My final code:
def resample_and_roll(df: pd.DataFrame):
df.reset_index(inplace=True)
df.set_index("match_date", drop=False, inplace=True)
df["roll_regression"] = (
df["stat"]
.rolling("11d", min_periods=3)
.apply(calc_regression, args=(df, ), raw=False)
)
df.set_index("index", inplace=True)
return df["roll_regression"]
def calc_regression(s: pd.Series, df: pd.DataFrame):
y_values = s[~np.isnan(s)][:-1].values
match_date = df.loc[s.index, "match_date"]
days_diff = match_date.diff().dt.days.shift(-1)
days_elapsed = days_diff.loc[::-1].cumsum()[::-1].values[:-1]
return np.polyfit(days_elapsed, y_values, deg=1)[0]
df["momentum"] = df.groupby("player").apply(resample_and_roll).droplevel(0)
Which outputs:
id player match_date stat momentum
0 1 1 2022-01-01 1500 NaN
1 2 1 2022-01-03 1600 NaN
2 3 1 2022-01-10 1500 -50.000000
3 4 1 2022-01-11 1800 3.731343
4 5 1 2022-01-18 2100 -300.000000
5 6 2 2022-01-01 2100 NaN
6 7 2 2022-01-03 1800 NaN
7 8 2 2022-01-10 1600 150.000000
8 9 2 2022-01-11 1500 48.507463
9 10 2 2022-01-18 1500 100.000000
Upvotes: 0
Reputation: 14664
Sice I don't have your data what I will give you is a general function that will
group by a given column
find the parameters to approximate the column given by y
parameter as a linear function of the column given by the x
parameter, based on the formulas from wikipedia's Simple linear regression page
def grouped_rolling_linear_regression(df, window=10, groups='l', x='x', y='y'):
'''
Computes linear regression parameters for a rolling window
withing grouped data where x
- df: the input data frame
- window: the wolling window
- groups: the column
Returns a dataframe with columns a, b, so that
a*x+b ~ y
'''
y = df[y]
x = df[x]
terms = pd.DataFrame({groups: df[groups], 'y': y, 'x': x, 'xx': x*x, 'xy': x*y})
terms[1] = 1;
s = terms.groupby(groups).rolling(window).sum()
# compute sum((x-mean(x))*(y-mean(y)))
Rxy = s['xy'] - s['x']*s['y'] / s[1]
# compute sum((x-mean(x))**2)
Rxx = s['xx'] - s['x']**2 / s[1]
a = Rxy / Rxx
b = (s['y'] - a * s['x']) / s[1]
return pd.DataFrame({'a': a, 'b': b}).dropna();
You can check that it works with a small example like this
s = pd.DataFrame([{'l':1, 'x': i, 'y': i} for i in range(100)] +
[{'l':2, 'x': i, 'y': -3.14*i + 482} for i in range(50)])
grouped_rolling_linear_regression(s)
This data has two groups, each with one equation, one has 100 samples, the other has 50 samples.
Since you mentioned that your concern was about taking ages to compute everything having 3M samples in 80k groups, I ran one test with random data. The generated data has 3.2M samples in 80k groups.
nplayers=80000
a,b = np.random.randn(2, 1, 40)
x = np.random.randn(80000, 40)
l = np.arange(len(x)).reshape(-1,1).repeat(40, axis=1)
y = a*x + b
s = pd.DataFrame()
for k,v in [('x',x), ('y',y), ('l',l)]:
s[k] = v.reshape(-1)
Using the default window size of 10, this will compute 2.4M valid linear regressions, here it finished in 19 seconds.
u = grouped_rolling_linear_regression(s)
The only thing you have to do in order to run the regression with your data is to create a different column with a numeric type, being the difference from a date to any initial date. For instance, the earliest date in your data frame.
import pandas as pd
import datetime as dt
dates = [
dt.datetime(2022, 1, 1),
dt.datetime(2022, 1, 3),
dt.datetime(2022, 1, 10),
dt.datetime(2022, 1, 11),
dt.datetime(2022, 1, 18),
]
df = pd.DataFrame(
{
"id": range(1, 11),
"player": [1 for i in range(5)] + [2 for i in range(5)],
"match_date": dates + dates,
"stat": (1500, 1600, 1500, 1800, 2100, 2100, 1800, 1600, 1500, 1500)
}
)
df['numeric-days'] = (df["match_date"] - min(df['match_date'])) / pd.to_timedelta(1, 'd')
grouped_rolling_linear_regression(df, x='numeric-days', y='stat', groups='player', window=4)
Given that the data has two groups with 5 entries each, and I choose window 4, it will fit two different lines to each category.
Upvotes: 1