Reputation: 53
I'm totally new to Data Science in general and was hoping someone could explain why this does not work:
I'm using the Advertising dataset from the following url: "http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv" which has 3 feature columns ("TV", "Radio", "Newspaper") and 1 label column ("sales"). My complete dataset is named data
.
Next, I try to use sklearn's StratifiedShuffleSplit
function to divide the data into training and testing sets.
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, random_state=0) # can use test_size=0.8
for train_index, test_index in split.split(data.drop("sales", axis=1), data["sales"]): # Generate indices to split data into training and test set.
strat_train_set = data.loc[train_index]
strat_test_set = data.loc[test_index]
I get this ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
Using the same code on another dataset which has 14 feature columns and 1 label column separates the data appropriately. Why doesn't it work here? Thanks.
Upvotes: 1
Views: 1769
Reputation: 2520
I think that problem is your data_y is 2D matrix.
but as I see in sklearn.model_selection.StratifiedShuffleSplit doc
, it should be the 1D
vector. Try to encode each row of data_y as the integer (it will be interpreted as a class), and after use split.
Or possibly your y is a regression variable (continuous numerical data).(Vivek's link)
Upvotes: 1