John Davis
John Davis

Reputation: 303

How to divide the data set into train, test and validation purpose

My data frame consist of 44.2 billions rows. I want to split it into 3 sets (train,test and validation). So that no points are overlapped.

I have done (1st process) -

train, valid, test = np.split(df.sample(frac=1), [int(.8*len(df)), int(.95*len(df))])

Checking any value is present or not -

len(valid[valid.id.isin(test.id)])
len(train[train.id.isin(test.id)])

2nd process -

train = df[(np.random.rand(len(df)) < 0.8)]
valid = df[(np.random.rand(len(df)) > 0.8) & (np.random.rand(len(df)) < 0.95)]
test = df[(np.random.rand(len(df)) > 0.95) & (np.random.rand(len(df)) < 1)]

But as per my understanding the above two methods are not perfect. Can anybody help me

Upvotes: 0

Views: 1663

Answers (1)

itsmeavinash
itsmeavinash

Reputation: 31

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.2, random_state=1)

X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.25, random_state=1)

The first line splits the dataset into 80:20 ratio. You can use the same function to split the 20% data into 15:5.

Upvotes: 1

Related Questions