jxn
jxn

Reputation: 8025

writing a train_test_split function with numpy

I am trying to write my own train test split function using numpy instead of using sklearn's train_test_split function. I am splitting the data into 70% training and 30% test. I am using the boston housing data set from sklearn.

This is the shape of the data:

housing_features.shape #(506,13) where 506 is sample size and it has 13 features.

This is my code:

city_data = datasets.load_boston()
housing_prices = city_data.target
housing_features = city_data.data

def shuffle_split_data(X, y):
    split = np.random.rand(X.shape[0]) < 0.7

    X_Train = X[split]
    y_Train = y[split]
    X_Test =  X[~split]
    y_Test = y[~split]

    print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
    return X_Train, y_Train, X_Test, y_Test

try:
    X_train, y_train, X_test, y_test = shuffle_split_data(housing_features, housing_prices)
    print "Successful"
except:
    print "Fail"

The print output i got is:

362 362 144 144
"Successful"

But i know it was not successful because i get a different numbers for the length when i run it again Versus just using SKlearn's train test function and always get 354 for the length of X_train.

#correct output
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing_features, housing_prices, test_size=0.3, random_state=42)
print len(X_train) 
#354 

What am i missing my my function?

Upvotes: 2

Views: 12391

Answers (1)

Anton Protopopov
Anton Protopopov

Reputation: 31662

Because you're using np.random.rand which gives you random numbers and it'll be close to 70% for 0.7 limit for very big numbers. You could use np.percentile for that to get value for 70% and then compare with that value as you did:

def shuffle_split_data(X, y):
    arr_rand = np.random.rand(X.shape[0])
    split = arr_rand < np.percentile(arr_rand, 70)

    X_train = X[split]
    y_train = y[split]
    X_test =  X[~split]
    y_test = y[~split]

    print len(X_Train), len(y_Train), len(X_Test), len(y_Test)
    return X_train, y_train, X_test, y_test

EDIT

Alternatively you could use np.random.choice to select indices with your desired amount. For your case:

np.random.choice(range(X.shape[0]), int(0.7*X.shape[0]))

Upvotes: 4

Related Questions