How to split a dataset (CSV) into training and test data

Question

How to split a dataset (CSV) into training and test data in Python programming language if there are no dependent variables in it?

The project I am currently working on is machine learning based and the dataset does not contain any dependent data. The following code works only if the dataset contains a dependent data-

from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.2, random_state = 0)

I expect the split to happen without any y variable. Is it possible?

Kathleen Allyson Harrison · Accepted Answer

There are two kinds of "random" distribution. 1) 100% random 2) 'random' but 'equal' distribution of data (i.e. same means / norms)

To answer your question, I would first recommend using a package for managing your data frames (i.e. Pandas)

see link for info: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html

So, if you wanted to get a random 50% sample of the DataFrame with replacement:

 df.sample(frac=0.5, replace=True, random_state=1)

How to split a dataset (CSV) into training and test data

Answers (2)

Related Questions