Reputation: 1063
I'm building a logistic regression-based machine learning problem using sklearn
. I'm essentially trying to simulate running the model each day to make a decision for the following day.
To get train_test_split
to take the set of choices available on the "test" day, I just set the split to:
splt = 1/observ_days
The size of the observation set is the observed days * available choices, which changes slightly depending on availability, but in the problem case is 45. So the total observations = 45 * observ_days, and the test set should be 45.
The challenges is that train_test_split always rounds up, so when you get a common floating number issue, it could produce an unexpected split. Specifically in my case when the observed days is 744. 1/744 = 0.0013440860215053765. The size of the total data at that moment is 33480. In normal math, 33480 * the split = 45, like it should. But Python comes up with 45.00000000000001, so train_test_split
gives me 46 test observations.
That's a problem because in my case the 46th observation is actually from another day. Is there a way to force train_test_split
to round down? Or impute the exact size of the train/test set manually?
Upvotes: 0
Views: 1167
Reputation: 4275
If you check scikit-learn documentation for train_test_split you'll notice you can specify train_size or test_size as both float (as a proportion of whole dataset to be used) and int (as a specific number of datapoints to be included).
In your case you could just specify test_size = 45
to always take exactly 45 datapoints for the test_set.
Upvotes: 1