Handle labels not present in train data. KNN sklearn

Question

I'm learning KNN and have run into issues with sklearn.LabelEncoder

ValueError: y contains previously unseen labels: "F"

I believe it's caused when I split the train/test data. Some test data ends up containing info not present in the train data.

I would like to ensure that calling leBrand.Transform("F") (Where F was not present in train data), will substitute F for a generic value say "Unknown".

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)

model = KNeighborsClassifier(n_neighbors=1)
model.fit(x_train, y_train)

# read in the new data to be predicted
data = pd.read_csv("wso-cats-to-predict.csv")

x = pd.DataFrame(data={"Brand": leBrand.transform(data["brand"]) })

data["brand"] contains an 'F' that was not present in train data. This throws the error noted above

I've tried to manipulate the array in various ways. If possible, I would rather transform any unknown tags to a single value.

Jan K · Accepted Answer

I would suggest stratifying in the train_test_split function:

sklearn.model_selection.train_test_split(x, y, test_size=0.1, stratify=y)

This will guarantee than both the train and test set have the same distribution of labels. Therefore you should never be in a position that there are new labels at inference time.

Handle labels not present in train data. KNN sklearn

Answers (1)

Related Questions