Cross Validation for longitudinal/panel data in scikit-learn

Question

I have some longitudinal/panel data that takes the form below (code for data entry is below the question). Observations of X and y are indexed by time and country (eg USA at time 1, USA at time 2, CAN at time 1).

    time x  y
USA 1    5  10
USA 2    5  12
USA 3    6  13
CAN 1    2  2
CAN 2    2  3
CAN 3    4  5

I'm trying to predict y using sklearn. For a reproducible example, we could use, say, linear regression.

In order to perform CV, I can't use test_train_split because then the split might, for example, put data from time = 3 in X_train, and data from time = 2 into y_test. This would be unhelpful, because at time = 2, when we would be trying to predict y, we would not yet really have data at time = 3 to train on.

I'm trying to use TimeSeriesSplit in order to achieve CV as shown in this image:

(source: https://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection)

y = df.y
X = df.drop(['y'], 1)
print(y)
print(X)

from sklearn.model_selection import TimeSeriesSplit

X = X.to_numpy()

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits = 2, max_train_size=3)
print(tscv)
for train_index, test_index in tscv.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

which gives close to what I need, but not quite:

TRAIN: [0 1] TEST: [2 3]
TRAIN: [1 2 3] TEST: [4 5]

How can I now use TimeSeriesSplit indices to cross-validate a model?

I believe a complication may be that my data isn't strictly time-series: it's not only indexed by time, but also by country, hence the longitudinal/panel nature of the data.

My desired output is:

A series of test and train indices that allow me to perform "walk forward" CV

eg

TRAIN: [1] TEST: [2]
TRAIN: [1 2] TEST: [3]

An X_train, x_test, y_test, y_train that are split using the index above, based on the value of time, or clarity as to whether I need to do that.
An accuracy score of any model (eg. linear regression) cross-validated using the "walk forward" CV method.

Edit: thank you to @sabacherli for answering the first part of my question, and fixing the errors that were being thrown up.

Code for Data Entry

import numpy as np
import pandas as pd

data = np.array([['country','time','x','y'],
                ['USA',1, 5, 10],
                ['USA',2, 5, 12],
                ['USA',3,6, 13],
                ['CAN',1,2, 2],
                ['CAN',2,2, 3],
                ['CAN',3,4, 5]],                
               )
                
df = pd.DataFrame(data=data[1:,1:],
                  index=data[1:,0],
                  columns=data[0,1:])

df

Akshay Sehgal · Accepted Answer

The TimeSeriesSplit assumes that your dataset is indexed by time, meaning each row belongs to a different time step. So why not unstack the data such you only have time as an index and then split. After the split, you can stack the data shape again to get your underlying table for training.

data = np.array([['country','time','x','y'],
                ['USA',1, 5, 10],
                ['USA',2, 5, 12],
                ['USA',3,6, 13],
                ['CAN',1,2, 2],
                ['CAN',2,2, 3],
                ['CAN',3,4, 5]],                
               )

df = pd.DataFrame(data=data[1:,1:],
                  index=data[1:,0],
                  columns=data[0,1:])

df1 = df.reset_index().set_index(['time','index']).unstack(-1)
print(df1)

        x       y    
index CAN USA CAN USA
time                 
1       2   5   2  10
2       2   5   3  12
3       4   6   5  13

Now, since each row is indexed by time, you can easily split this data into groups and then after split, stack again to get your X_train X_test, etc...

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits = 2, max_train_size=3)

X_cols = ['time', 'index', 'x']
y_cols = ['y']

for train_index, test_index in tscv.split(df1):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = df1.iloc[train_index].stack(-1).reset_index()[X_cols].to_numpy(), df1.iloc[test_index].stack(-1).reset_index()[X_cols].to_numpy()
    y_train, y_test = df1.iloc[train_index].stack(-1).reset_index()[y_cols].to_numpy(), df1.iloc[test_index].stack(-1).reset_index()[y_cols].to_numpy()

TRAIN: [0] TEST: [1]
TRAIN: [0 1] TEST: [2]

You can print the latest fold's X_train and y_train to see whats happening -

print('For - TRAIN: [0 1] TEST: [2]')
print(" ")
print("X_train:")
print(X_train)
print(" ")
print("X_test:")
print(X_test)
print(" ")
print("y_train:")
print(y_train)
print(" ")
print("y_test:")
print(y_test)
print("X_train:")
print(X_train)
print(" ")
print("X_test:")
print(X_test)
print(" ")
print("y_train:")
print(y_train)
print(" ")
print("y_test:")
print(y_test)

For - TRAIN: [0 1] TEST: [2]

X_train:
[['1' 'CAN' '2']
 ['1' 'USA' '5']
 ['2' 'CAN' '2']
 ['2' 'USA' '5']]
 
X_test:
[['3' 'CAN' '4']
 ['3' 'USA' '6']]
 
y_train:
[['2']
 ['10']
 ['3']
 ['12']]
 
y_test:
[['5']
 ['13']]

So now you can split a dataframe by time, and expand it back to the shape you need it for training.

Cross Validation for longitudinal/panel data in scikit-learn

Code for Data Entry

Answers (1)

Related Questions