Reputation: 306
I have a large dataset (around 200k rows), i wanted to split the dataset into 2 parts randomly, 70% as the training data and 30% as the testing data. Is there a way to do this in python? Note I also want to get these datasets saved as excel or csv files in my computer. Thanks!
Upvotes: 1
Views: 11686
Reputation: 667
Start by importing the following:
from sklearn.model_selection import train_test_split
import pandas as pd
In order to split you can use the train_test_split function from sklearn package:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
where X, y is your taken from your original dataframe.
Later, you can export each of them as CSV using the pandas package:
X_train.to_csv(index=False)
X_test.to_csv(index=False)
Same goes for y data as well.
EDIT: as you clarified the question and required both X and y factors on the same file, you can do the following:
train, test = train_test_split(yourdata, test_size=0.3, random_state=42)
and then export them to csv as I mentioned above.
Upvotes: 0
Reputation: 184
from sklearn.model_selection import train_test_split
#split the data into train and test set
train,test = train_test_split(data, test_size=0.30, random_state=0)
#save the data
train.to_csv('train.csv',index=False)
test.to_csv('test.csv',index=False)
Upvotes: 5