Shuffle and spread uniformly one kind of row in a csv file in python

Question

I have a dataset csv file with features and predictions looking like that:

Feature1    Feature2    Prediction
214         ast         0
222         bbr         0
845         iop         0
110         frn         1
...

I am trying to shuffle the csv file this way:

import csv
import random

with open("dataset.csv") as f:
    r = csv.reader(f)
    header, l = next(r), list(r)

random.shuffle(l)

with open("dataset_shuffled.csv", "wb") as f:
    csv.writer(f).writerows([header] + l)

However the lines with a 1 prediction are only 1% of the full dataset. As I want to separate this dataset into train/test sets, I would like to spread equally/uniformly the 1 predictions in the dataset.

How can I do that during the shuffling?

Pierluigi · Accepted Answer

Instead of reinventing the wheel maybe you can use a combination of Pandas and Scikit-Learn. In particular you can read a csv in a Pandas Dataframe like:

import pandas
df = pandas.read_csv('your_csv.csv')

at this point you may want to create x (feature set) and y (target):

x = df[['Feature1', 'Feature2']]
y = df[['Prediction']]

and use Scikit-Learn to create training and test sets:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

check here for further details about train_test_split.

Shuffle and spread uniformly one kind of row in a csv file in python

Answers (1)

Related Questions