Reputation: 385
I want to perform a nested cross-validation for a classification problem while ensuring that the model is not exposed to future data. Since the data is time-series, I plan to use a time-aware splitting strategy. In the inner loop of the nested cross-validation, I aim to apply SMOTE-Tomek, But I'm not sure of how to do this.
This is my sample dataframe.
# Test data
data = pd.DataFrame({
"Date": pd.date_range(start="2023-01-01", periods=100, freq='D'),
"Feature1": np.random.rand(100),
"Feature2": np.random.rand(100),
'Category': [random.choice(['A', 'B', 'C']) for _ in range(100)],
"Target": np.random.choice([0, 1], size=100)
})
and this is my code so far, not sure if my approach is correct
!pip install imbalanced-learn
import pandas as pd
import numpy as np
import random
from sklearn.model_selection import cross_validate, GridSearchCV, TimeSeriesSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import category_encoders as ce
from imblearn.combine import SMOTETomek
# Defining X and y
encoder = ce.OneHotEncoder(handle_missing='value', handle_unknown='value', use_cat_names=True)
X0 = data[['Feature1', 'Feature2', 'Category']]
X_enc = encoder.fit_transform(X0)
y = data[['Target']]
X = np.array(X_enc)
y = np.array(y)
# Start nested cross-validation
# Defining each Model
RFC = RandomForestClassifier(random_state=0)
LR = LogisticRegression(random_state=0)
# Defining the cross-validation function with TimeSeriesSplit
def CrossValidate(model, X, y, print_model=False):
# Using TimeSeriesSplit for splitting by date
tscv = TimeSeriesSplit(n_splits=10)
cv = cross_validate(model, X, y, scoring='f1_macro', cv=tscv)
# Join scores and calculate the mean
scores = ' + '.join(f'{s:.2f}' for s in cv["test_score"])
mean_ = cv["test_score"].mean()
# Message formatting for classification model output
msg = f'Cross-validated F1 score: ({scores}) / 10 = {mean_:.2f}'
if print_model:
msg = f'{model}:\n\t{msg}\n'
print(msg)
# Inner loops
# Logistic Regression inner loop
LR_grid = GridSearchCV(LogisticRegression(random_state=0),
param_grid={'C': [10, 100]})
CrossValidate(LR_grid, X, y, print_model=False)
LR_grid.fit(X, y)
print('The best Parameters for Logistic Regression are:', LR_grid.best_params_)
# Random Forest inner loop
RFC_grid = GridSearchCV(RandomForestClassifier(random_state=0),
param_grid={'n_estimators': [2, 3],
'max_depth': [3, 5]})
CrossValidate(RFC_grid, X, y, print_model=False)
RFC_grid.fit(X, y)
print('The best Parameters for Random Forest are:', RFC_grid.best_params_)
Upvotes: 1
Views: 26