Reputation: 2884
I was trying to split the sample dataset using Scikit-learn's Stratified Shuffle Split. I followed the example shown on the Scikit-learn documentation here
import pandas as pd
import numpy as np
# UCI's wine dataset
wine = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
# separate target variable from dataset
target = wine['quality']
data = wine.drop('quality',axis = 1)
# Stratified Split of train and test data
from sklearn.cross_validation import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(target, n_iter=3, test_size=0.2)
for train_index, test_index in sss:
xtrain, xtest = data[train_index], data[test_index]
ytrain, ytest = target[train_index], target[test_index]
# Check target series for distribution of classes
ytrain.value_counts()
ytest.value_counts()
However, upon running this script, I get the following error:
IndexError: indices are out-of-bounds
Could someone please point out what I am doing wrong here? Thanks!
Upvotes: 24
Views: 12836
Reputation: 30561
You're running into the different conventions for Pandas DataFrame
indexing versus NumPy ndarray
indexing. The arrays train_index
and test_index
are collections of row indices. But data
is a Pandas DataFrame
object, and when you use a single index into that object, as in data[train_index]
, Pandas is expecting train_index
to contain column labels rather than row indices. You can either convert the dataframe to a NumPy array, using .values
:
data_array = data.values
for train_index, test_index in sss:
xtrain, xtest = data_array[train_index], data_array[test_index]
ytrain, ytest = target[train_index], target[test_index]
or use the Pandas .iloc
accessor:
for train_index, test_index in sss:
xtrain, xtest = data.iloc[train_index], data.iloc[test_index]
ytrain, ytest = target[train_index], target[test_index]
I tend to favour the second approach, since it gives xtrain
and xtest
of type DataFrame
rather than ndarray
, and so keeps the column labels.
Upvotes: 47