Reputation: 31
I would like to split my dataset without using the sklearn library. Below are the methods I've used.
My current code:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)
What I tried:
def non_shuffling_train_test_split(X, y, test_size=0.2):
i = int((1 - test_size) * X.shape[0]) + 1
X_train, X_test = np.split(X, [i])
y_train, y_test = np.split(y, [i])
return X_train, X_test, y_train, y_test
However, the code above is not randomized.
Upvotes: 0
Views: 2155
Reputation: 46978
You can create a shuffled order using np.random.permutation
and then subset using np.take
, this should work on both numpy array and pd dataframes:
def tt_split(X, y, test_size=0.2):
i = int((1 - test_size) * X.shape[0])
o = np.random.permutation(X.shape[0])
X_train, X_test = np.split(np.take(X,o,axis=0), [i])
y_train, y_test = np.split(np.take(y,o), [i])
return X_train, X_test, y_train, y_test
Test it on numpy array:
X = np.random.normal(0,1,(50,10))
y = np.random.normal(0,1,(50,))
X_train, X_test, y_train, y_test = tt_split(X,y)
[X_train.shape,y_train.shape]
[(40, 10), (40,)]
Test it on pandas data frame:
X = pd.DataFrame(np.random.normal(0,1,(50,10)))
y = pd.Series(np.random.normal(0,1,50))
X_train, X_test, y_train, y_test = tt_split(X,y)
[X_train.shape,y_train.shape]
[(40, 10), (40,)]
Upvotes: 3