Reputation: 583
Here's the code I've been working on.
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
housing_data = load_boston()
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing_data, test_size = 0.2, random_state = 42)
And I get this error.
/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py in train_test_split(*arrays, **options)
2057
2058 return list(chain.from_iterable((safe_indexing(a, train),
-> 2059 safe_indexing(a, test)) for a in arrays))
2060
2061
/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py in <genexpr>(.0)
2057
2058 return list(chain.from_iterable((safe_indexing(a, train),
-> 2059 safe_indexing(a, test)) for a in arrays))
2060
2061
/anaconda3/lib/python3.6/site-packages/sklearn/utils/__init__.py in safe_indexing(X, indices)
162 return X[indices]
163 else:
--> 164 return [X[idx] for idx in indices]
165
166
/anaconda3/lib/python3.6/site-packages/sklearn/utils/__init__.py in <listcomp>(.0)
162 return X[indices]
163 else:
--> 164 return [X[idx] for idx in indices]
165
166
KeyError: 3
Upvotes: 1
Views: 222
Reputation: 8187
If you look at the documentation for load_boston()
you'll see it returns a Bunch
object. If you inspect that object in Spyder's variable explorer you can see it contains a description, the actual data (the features you can make your predictions from), the labels for each of those features, and the target vector containing the value you're trying to predict.
You can instead run the following if you're looking to only get the data portion (the data features for prediction):
train_set, test_set = train_test_split(housing_data.data, test_size = 0.2, random_state = 42)
Alternatively, you can create training and test sets for both X and y (features and target) with the following:
X_train, X_test, y_train, y_test = train_test_split(housing_data.data, housing_data.target, test_size = 0.2, random_state = 42)
Which yields the following set of variables:
Edit: If you call load_boston() with the return_X_y = True
parameter, it returns a tuple of (data, target)
, allowing you to do the following, which is arguably more elegant:
X, y = load_boston(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
Upvotes: 4