vivek
vivek

Reputation: 583

Unable to create a test and training set using sklearn

Here's the code I've been working on.

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
housing_data = load_boston()

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing_data, test_size = 0.2, random_state = 42)

And I get this error.

/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py in train_test_split(*arrays, **options)
   2057 
   2058     return list(chain.from_iterable((safe_indexing(a, train),
-> 2059                                      safe_indexing(a, test)) for a in arrays))
   2060 
   2061 

/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py in <genexpr>(.0)
   2057 
   2058     return list(chain.from_iterable((safe_indexing(a, train),
-> 2059                                      safe_indexing(a, test)) for a in arrays))
   2060 
   2061 

/anaconda3/lib/python3.6/site-packages/sklearn/utils/__init__.py in safe_indexing(X, indices)
    162             return X[indices]
    163     else:
--> 164         return [X[idx] for idx in indices]
    165 
    166 

/anaconda3/lib/python3.6/site-packages/sklearn/utils/__init__.py in <listcomp>(.0)
    162             return X[indices]
    163     else:
--> 164         return [X[idx] for idx in indices]
    165 
    166 

KeyError: 3

Upvotes: 1

Views: 222

Answers (1)

Mihai Chelaru
Mihai Chelaru

Reputation: 8187

If you look at the documentation for load_boston() you'll see it returns a Bunch object. If you inspect that object in Spyder's variable explorer you can see it contains a description, the actual data (the features you can make your predictions from), the labels for each of those features, and the target vector containing the value you're trying to predict.

load_boston

You can instead run the following if you're looking to only get the data portion (the data features for prediction):

train_set, test_set = train_test_split(housing_data.data, test_size = 0.2, random_state = 42)

Alternatively, you can create training and test sets for both X and y (features and target) with the following:

X_train, X_test, y_train, y_test = train_test_split(housing_data.data, housing_data.target, test_size = 0.2, random_state = 42)

Which yields the following set of variables:

boston train_test_split

Edit: If you call load_boston() with the return_X_y = True parameter, it returns a tuple of (data, target), allowing you to do the following, which is arguably more elegant:

X, y = load_boston(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Upvotes: 4

Related Questions