Applying SMOTE-Tomek to nested cross validation with timeseries data

Question

I want to perform a nested cross-validation for a classification problem while ensuring that the model is not exposed to future data. Since the data is time-series, I plan to use a time-aware splitting strategy. In the inner loop of the nested cross-validation, I aim to apply SMOTE-Tomek, But I'm not sure of how to do this.

This is my sample dataframe.

# Test data
data = pd.DataFrame({
    "Date": pd.date_range(start="2023-01-01", periods=100, freq='D'),
    "Feature1": np.random.rand(100),
    "Feature2": np.random.rand(100),
    'Category': [random.choice(['A', 'B', 'C']) for _ in range(100)],
    "Target": np.random.choice([0, 1], size=100)
})

and this is my code so far, not sure if my approach is correct

!pip install imbalanced-learn

import pandas as pd
import numpy as np
import random
from sklearn.model_selection import cross_validate, GridSearchCV, TimeSeriesSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import category_encoders as ce
from imblearn.combine import SMOTETomek

# Defining X and y
encoder = ce.OneHotEncoder(handle_missing='value', handle_unknown='value', use_cat_names=True)

X0 = data[['Feature1', 'Feature2', 'Category']]
X_enc = encoder.fit_transform(X0)
y = data[['Target']]

X = np.array(X_enc)
y = np.array(y)

# Start nested cross-validation
# Defining each Model
RFC = RandomForestClassifier(random_state=0)
LR = LogisticRegression(random_state=0)

# Defining the cross-validation function with TimeSeriesSplit
def CrossValidate(model, X, y, print_model=False):
    # Using TimeSeriesSplit for splitting by date
    tscv = TimeSeriesSplit(n_splits=10)  

    cv = cross_validate(model, X, y, scoring='f1_macro', cv=tscv)
    
    # Join scores and calculate the mean
    scores = ' + '.join(f'{s:.2f}' for s in cv["test_score"])
    mean_ = cv["test_score"].mean()
    
    # Message formatting for classification model output
    msg = f'Cross-validated F1 score: ({scores}) / 10 = {mean_:.2f}'
    
    if print_model:
        msg = f'{model}:
	{msg}
'
    
    print(msg)

# Inner loops
# Logistic Regression inner loop
LR_grid = GridSearchCV(LogisticRegression(random_state=0), 
                       param_grid={'C': [10, 100]})
CrossValidate(LR_grid, X, y, print_model=False)
LR_grid.fit(X, y)
print('The best Parameters for Logistic Regression are:', LR_grid.best_params_)

# Random Forest inner loop
RFC_grid = GridSearchCV(RandomForestClassifier(random_state=0), 
                       param_grid={'n_estimators': [2, 3],
                                   'max_depth': [3, 5]})
CrossValidate(RFC_grid, X, y, print_model=False)
RFC_grid.fit(X, y)
print('The best Parameters for Random Forest are:', RFC_grid.best_params_)

Applying SMOTE-Tomek to nested cross validation with timeseries data

Answers (0)

Related Questions