Reputation: 53
How to split a dataset (CSV) into training and test data in Python programming language if there are no dependent variables in it?
The project I am currently working on is machine learning based and the dataset does not contain any dependent data. The following code works only if the dataset contains a dependent data-
from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.2, random_state = 0)
I expect the split to happen without any y
variable.
Is it possible?
Upvotes: 1
Views: 1399
Reputation: 187
To split the dataset into train and test sets, we could shuffle the entire dataset first and slice it out based on the required size.
import pandas as pd
shuffle = df.sample(frac=1)
train_size = int(0.8 * len(df))
train = shuffle[:train_size]
test = shuffle[train_size:]
Upvotes: 0
Reputation: 26
There are two kinds of "random" distribution. 1) 100% random 2) 'random' but 'equal' distribution of data (i.e. same means / norms)
To answer your question, I would first recommend using a package for managing your data frames (i.e. Pandas)
see link for info: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html
So, if you wanted to get a random 50% sample of the DataFrame with replacement:
df.sample(frac=0.5, replace=True, random_state=1)
Upvotes: 1