Dumbo The Elephant
Dumbo The Elephant

Reputation: 31

Split dataset without using Scikit-Learn train_test_split

I would like to split my dataset without using the sklearn library. Below are the methods I've used.

My current code:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

What I tried:

def non_shuffling_train_test_split(X, y, test_size=0.2):
    i = int((1 - test_size) * X.shape[0]) + 1
    X_train, X_test = np.split(X, [i])
    y_train, y_test = np.split(y, [i])
    return X_train, X_test, y_train, y_test

However, the code above is not randomized.

Upvotes: 0

Views: 2155

Answers (1)

StupidWolf
StupidWolf

Reputation: 46978

You can create a shuffled order using np.random.permutation and then subset using np.take, this should work on both numpy array and pd dataframes:

def tt_split(X, y, test_size=0.2):

    i = int((1 - test_size) * X.shape[0]) 
    o = np.random.permutation(X.shape[0])
    
    X_train, X_test = np.split(np.take(X,o,axis=0), [i])
    y_train, y_test = np.split(np.take(y,o), [i])
    return X_train, X_test, y_train, y_test

Test it on numpy array:

X = np.random.normal(0,1,(50,10))
y = np.random.normal(0,1,(50,))
X_train, X_test, y_train, y_test = tt_split(X,y)
[X_train.shape,y_train.shape]
[(40, 10), (40,)]

Test it on pandas data frame:

X = pd.DataFrame(np.random.normal(0,1,(50,10)))
y = pd.Series(np.random.normal(0,1,50))
X_train, X_test, y_train, y_test = tt_split(X,y)
[X_train.shape,y_train.shape]
[(40, 10), (40,)]

Upvotes: 3

Related Questions