Train-test Split of a CSV file in Python

Question

I have a .csv file that contains my data. I would like to do Logistic Regression, Naive Bayes and Decision Trees. I already know how to implement these.

However, my teacher wants me to split the data in my .csv file into 80% and let my algorithms predict the other 20%. I would like to know how to actually split the data in that way.

diabetes_df = pd.read_csv("diabetes.csv")
diabetes_df.head()

with open("diabetes.csv", "rb") as f:
    data = f.read().split()
    train_data = data[:80]
    test_data = data[20:]

I tried to split it like this (sure it isn't working).

Martin Thoma · Accepted Answer

Workflow

Load the data (see How do I read and write CSV files with Python? )
Preprocess the data (e.g. filtering / creating new features)
Make the train-test (validation and dev-set) split

Code

Sklearns sklearn.model_selection.train_test_split is what you are looking for:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=0)

Train-test Split of a CSV file in Python

Answers (2)

Workflow

Code

Related Questions