Reputation: 1731
I'm trying to split data into training, validation, and test using numpy and pandas.
I know this works (it's from the sklearn
Iris example):
DataFrame['is_train'] = np.random.uniform(0, 1, len(train)) <= .75
train, test = DataFrame[DataFrame['is_train']==True], DataFrame[DataFrame['is_train']==False]
But how do I do something similar for a range of values, eg, .33 < x < .66?
This does not work:
DataFrame['segment'] = np.random.uniform(0, 1, len(df))
DataFrame[DataFrame['segment'] < .33 & DataFrame['segment'] < .66]
Finally, if you're aware of a better way, pray tell.
To the best of my knowledge, sklearn
's cross_validation.train_test_split()
doesn't do three-way splits.
Upvotes: 0
Views: 309
Reputation: 394071
Wrap the conditions in parentheses:
DataFrame[(DataFrame['segment'] < .33) & (DataFrame['segment'] < .66)]
The &
operator has higher precedence than <
: https://docs.python.org/2/reference/expressions.html#operator-precedence
Also typically one splits the data into various splits according to whatever criteria you desire: http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation and you iterate over the various splits to test the robustness of your model. It's not that useful IMO to have a fixed validation set as how do you know how representative that validation set is?
Upvotes: 1