Reputation: 3093
I'm learning KNN and have run into issues with sklearn.LabelEncoder
ValueError: y contains previously unseen labels: "F"
I believe it's caused when I split the train/test data. Some test data ends up containing info not present in the train data.
I would like to ensure that calling leBrand.Transform("F") (Where F was not present in train data), will substitute F for a generic value say "Unknown".
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)
model = KNeighborsClassifier(n_neighbors=1)
model.fit(x_train, y_train)
# read in the new data to be predicted
data = pd.read_csv("wso-cats-to-predict.csv")
x = pd.DataFrame(data={"Brand": leBrand.transform(data["brand"]) })
data["brand"] contains an 'F' that was not present in train data. This throws the error noted above
I've tried to manipulate the array in various ways. If possible, I would rather transform any unknown tags to a single value.
Upvotes: 0
Views: 527
Reputation: 4150
I would suggest stratifying in the train_test_split
function:
sklearn.model_selection.train_test_split(x, y, test_size=0.1, stratify=y)
This will guarantee than both the train and test set have the same distribution of labels. Therefore you should never be in a position that there are new labels at inference time.
Upvotes: 2