asmgx
asmgx

Reputation: 8044

How to split my dataset into Test and Train without repitition?

I am developing a Python script to test an algorithm. I have a dataset that I need to split into 80% for training and 20% for testing. However, I want to save the test set for further analysis, ensuring no overlap with previous test sets.

Although my code works well overall, I encountered one issue: the test dataset sometimes contains records that were already selected in previous test runs due to the random selection process.

In the end of the process all 100% of the records should be tested at one of the runs

To clarify with an example:

As you can see, the record {6} was selected twice for testing, which I want to avoid.

How can I modify the code to ensure that the 20% test set is chosen randomly each time but excludes any records that were previously selected?

Here is the current code:

df = pd.read_csv("CustomersInfo.csv")
y = df['CustomerRank']
X = df.drop('CustomerRank', axis=1, errors='ignore')


#-------------------------------------------------------------------
#This is the part that need to be fixed
for RandStat in [11, 22, 33, 44, 55]:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RandStat)
#-------------------------------------------------------------------



    clf = XGBClassifier(random_state=RandStat)
    clf.fit(X_train, y_train)
    fnStoreAnalyse(y_train)

Upvotes: -1

Views: 39

Answers (1)

Matt Hall
Matt Hall

Reputation: 8152

You are describing folded cross-validation, whereas train_test_split is really designed for 'hold out' validation. Read Raschka 2018 for the full lowdown on this.

To avoid samples appearing in more than one 'fold', you need groups, eg as implemented in GroupKFold.

This way, you can assign the samples to groups yourself, however you like, then use those to split the data for cross-validation, for example using sklearn.model_selection.cross_val_score() or sklearn.model_selection.cross_validate(), as described in the User Guide.

Upvotes: 0

Related Questions