Reputation: 2217
I have a dataset csv file with features and predictions looking like that:
Feature1 Feature2 Prediction
214 ast 0
222 bbr 0
845 iop 0
110 frn 1
...
I am trying to shuffle the csv file this way:
import csv
import random
with open("dataset.csv") as f:
r = csv.reader(f)
header, l = next(r), list(r)
random.shuffle(l)
with open("dataset_shuffled.csv", "wb") as f:
csv.writer(f).writerows([header] + l)
However the lines with a 1 prediction are only 1% of the full dataset. As I want to separate this dataset into train/test sets, I would like to spread equally/uniformly the 1 predictions in the dataset.
How can I do that during the shuffling?
Upvotes: 0
Views: 132
Reputation: 1118
Instead of reinventing the wheel maybe you can use a combination of Pandas and Scikit-Learn. In particular you can read a csv in a Pandas Dataframe like:
import pandas
df = pandas.read_csv('your_csv.csv')
at this point you may want to create x
(feature set) and y
(target):
x = df[['Feature1', 'Feature2']]
y = df[['Prediction']]
and use Scikit-Learn to create training and test sets:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)
check here for further details about train_test_split
.
Upvotes: 1