Regressor
Regressor

Reputation: 1973

How to create custom k-fold cross validation datasets for training models

I have a dataset on daily level granularity for 4 years - 2018, 2019, 2020 and 2021. There is also some data available for Q1 2022 which I will be using as unseen data for model testing. I want to use K-fold for creating datasets per year where in I can loop through each fold and train a model and generate error metrics -

Here is what I am trying to do - Training Data - 2018-01-01 to 2021-12-31 Unseen Data - 2022-01-01 to 2022-03-31

From the training data, I want to generate the folds as below -

iteration 1 -
training data - 2018-01-01 to 2018-12-31, validation data - 2019-01-01 to 2019-03-31
iteration 2 -
training data - 2019-01-01 to 2019-12-31, validation data - 2020-01-01 to 2020-03-31
iteration 3 - 
training data - 2020-01-01 to 2020-12-31, validation data - 2021-01-01 to 2021-03-31

Once I create these sets, then I can use training data for training and validation data for evaluation. How can I do this in pandas?

Here is the sample data (other fields are hidden for confidential purposes) - enter image description here

Upvotes: 0

Views: 410

Answers (1)

eschibli
eschibli

Reputation: 872

Scitkit-learn's TimeSeriesSplit would allow you to generate continuous train and test folds of defined size - TimeSeriesSplit(max_train_size=365, test_size=91) will produce train folds of one year and test folds of (approximately) one quarter (note that you will drift away from calendar years by 1.25 days/year)

This should work for you if, as you suggest, it isn't critical to only test on Q1 of each year. If you prefer to only test Q1, you should be able to do so with a list comprehension and pandass datetime indexing, like:

years = np.arange(2018, 2021)

# set drop=False if you wish to retain the old index as a column
df = df.set_index("created_date", drop=True)  

df.index = pd.to_datetime(df.index)  # If it isn't already

cv_splits = [(df[f"{year}"], df[f"{year+1}-1":f"{year+1}-3"]) for year in years]

# returns a list of (train_df, test_df) tuples

This should give you a list of tuples, that each contain first the all the samples from a single year, then all the samples from the first quarter of the following year.

Upvotes: 1

Related Questions