Selecting rows based on range of vector values in Pandas

Question

I'm trying to split data into training, validation, and test using numpy and pandas.

I know this works (it's from the sklearn Iris example):

DataFrame['is_train'] = np.random.uniform(0, 1, len(train)) <= .75
train, test = DataFrame[DataFrame['is_train']==True], DataFrame[DataFrame['is_train']==False]

But how do I do something similar for a range of values, eg, .33 < x < .66?

This does not work:

DataFrame['segment'] = np.random.uniform(0, 1, len(df))
DataFrame[DataFrame['segment'] < .33 & DataFrame['segment'] < .66]

Finally, if you're aware of a better way, pray tell.

To the best of my knowledge, sklearn's cross_validation.train_test_split() doesn't do three-way splits.

EdChum · Accepted Answer

Wrap the conditions in parentheses:

DataFrame[(DataFrame['segment'] < .33) & (DataFrame['segment'] < .66)]

The & operator has higher precedence than <: https://docs.python.org/2/reference/expressions.html#operator-precedence

Also typically one splits the data into various splits according to whatever criteria you desire: http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation and you iterate over the various splits to test the robustness of your model. It's not that useful IMO to have a fixed validation set as how do you know how representative that validation set is?

Selecting rows based on range of vector values in Pandas

Answers (1)

Related Questions