Reputation: 1792
I have some longitudinal/panel data that takes the form below (code for data entry is below the question). Observations of X and y are indexed by time and country (eg USA at time 1, USA at time 2, CAN at time 1).
time x y
USA 1 5 10
USA 2 5 12
USA 3 6 13
CAN 1 2 2
CAN 2 2 3
CAN 3 4 5
I'm trying to predict y
using sklearn. For a reproducible example, we could use, say, linear regression.
In order to perform CV, I can't use test_train_split
because then the split might, for example, put data from time = 3
in X_train
, and data from time = 2
into y_test
. This would be unhelpful, because at time = 2
, when we would be trying to predict y
, we would not yet really have data at time = 3
to train on.
I'm trying to use TimeSeriesSplit
in order to achieve CV as shown in this image:
(source: https://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection)
y = df.y
X = df.drop(['y'], 1)
print(y)
print(X)
from sklearn.model_selection import TimeSeriesSplit
X = X.to_numpy()
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits = 2, max_train_size=3)
print(tscv)
for train_index, test_index in tscv.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
which gives close to what I need, but not quite:
TRAIN: [0 1] TEST: [2 3]
TRAIN: [1 2 3] TEST: [4 5]
TimeSeriesSplit
indices to cross-validate a model?I believe a complication may be that my data isn't strictly time-series: it's not only indexed by time
, but also by country
, hence the longitudinal/panel nature of the data.
My desired output is:
eg
TRAIN: [1] TEST: [2]
TRAIN: [1 2] TEST: [3]
An X_train
, x_test
, y_test
, y_train
that are split using the index above, based on the value of time
, or clarity as to whether I need to do that.
An accuracy score of any model (eg. linear regression) cross-validated using the "walk forward" CV method.
Edit: thank you to @sabacherli for answering the first part of my question, and fixing the errors that were being thrown up.
import numpy as np
import pandas as pd
data = np.array([['country','time','x','y'],
['USA',1, 5, 10],
['USA',2, 5, 12],
['USA',3,6, 13],
['CAN',1,2, 2],
['CAN',2,2, 3],
['CAN',3,4, 5]],
)
df = pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:])
df
Upvotes: 3
Views: 2470
Reputation: 19322
The TimeSeriesSplit
assumes that your dataset is indexed by time, meaning each row belongs to a different time step. So why not unstack
the data such you only have time as an index and then split. After the split, you can stack
the data shape again to get your underlying table for training.
data = np.array([['country','time','x','y'],
['USA',1, 5, 10],
['USA',2, 5, 12],
['USA',3,6, 13],
['CAN',1,2, 2],
['CAN',2,2, 3],
['CAN',3,4, 5]],
)
df = pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:])
df1 = df.reset_index().set_index(['time','index']).unstack(-1)
print(df1)
x y
index CAN USA CAN USA
time
1 2 5 2 10
2 2 5 3 12
3 4 6 5 13
Now, since each row is indexed by time, you can easily split this data into groups and then after split, stack again to get your X_train X_test, etc...
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits = 2, max_train_size=3)
X_cols = ['time', 'index', 'x']
y_cols = ['y']
for train_index, test_index in tscv.split(df1):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = df1.iloc[train_index].stack(-1).reset_index()[X_cols].to_numpy(), df1.iloc[test_index].stack(-1).reset_index()[X_cols].to_numpy()
y_train, y_test = df1.iloc[train_index].stack(-1).reset_index()[y_cols].to_numpy(), df1.iloc[test_index].stack(-1).reset_index()[y_cols].to_numpy()
TRAIN: [0] TEST: [1]
TRAIN: [0 1] TEST: [2]
You can print the latest fold's X_train and y_train to see whats happening -
print('For - TRAIN: [0 1] TEST: [2]')
print(" ")
print("X_train:")
print(X_train)
print(" ")
print("X_test:")
print(X_test)
print(" ")
print("y_train:")
print(y_train)
print(" ")
print("y_test:")
print(y_test)
print("X_train:")
print(X_train)
print(" ")
print("X_test:")
print(X_test)
print(" ")
print("y_train:")
print(y_train)
print(" ")
print("y_test:")
print(y_test)
For - TRAIN: [0 1] TEST: [2]
X_train:
[['1' 'CAN' '2']
['1' 'USA' '5']
['2' 'CAN' '2']
['2' 'USA' '5']]
X_test:
[['3' 'CAN' '4']
['3' 'USA' '6']]
y_train:
[['2']
['10']
['3']
['12']]
y_test:
[['5']
['13']]
So now you can split a dataframe by time, and expand it back to the shape you need it for training.
Upvotes: 2