Train_test_split produces unexpected sample size due to rounding

Question

I'm building a logistic regression-based machine learning problem using sklearn. I'm essentially trying to simulate running the model each day to make a decision for the following day.

To get train_test_split to take the set of choices available on the "test" day, I just set the split to:

splt = 1/observ_days

The size of the observation set is the observed days * available choices, which changes slightly depending on availability, but in the problem case is 45. So the total observations = 45 * observ_days, and the test set should be 45.

The challenges is that train_test_split always rounds up, so when you get a common floating number issue, it could produce an unexpected split. Specifically in my case when the observed days is 744. 1/744 = 0.0013440860215053765. The size of the total data at that moment is 33480. In normal math, 33480 * the split = 45, like it should. But Python comes up with 45.00000000000001, so train_test_split gives me 46 test observations.

That's a problem because in my case the 46th observation is actually from another day. Is there a way to force train_test_split to round down? Or impute the exact size of the train/test set manually?

dm2 · Accepted Answer

If you check scikit-learn documentation for train_test_split you'll notice you can specify train_size or test_size as both float (as a proportion of whole dataset to be used) and int (as a specific number of datapoints to be included).

In your case you could just specify test_size = 45 to always take exactly 45 datapoints for the test_set.

Train_test_split produces unexpected sample size due to rounding

Answers (1)

Related Questions