Error when using one-hot encoded labels in sklearn GaussianNB

Question

I have a dataset:

[['s002'   ... 0.3509 0.2171 0.0742]
 ['s002'   ... 0.2756 0.1917 0.0747]
 ['s002'   ... 0.2847 0.1762 0.0945]
 ...
 ['s057'   ... 0.2017 0.0983 0.0905]
 ['s057'   ... 0.1917 0.0938 0.0931]
 ['s057'   ... 0.1993 0.1186 0.1018]]

's002' to 's057' are the labels (Y)

I am reading the dataset using pandas:

data = pd.read_csv('data.csv').values

then, I am preparing inputs and outputs:

# preparing inputs
X = []
for i in range(0, len(data)):
    X.append(data[i][3:])

# preparing outputs
y = []
for i in range(0, len(data)):
    y.append([data[i][0]])

I am also using OneHotEncoder:

# one hot encoding
enc = OneHotEncoder()
enc.fit(y)
y = enc.transform(y).toarray()

After all these, I am splitting and converting data:

# splitting data -> train 70%, test 15%, validation 15% (total 20400)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y, test_size=0.15,
                                                    random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
                                                  test_size=0.17645,
                                                  random_state=1)

# converting list to ndarray and converting datatypes
X_train = np.asarray(X_train, dtype=np.float)
X_test = np.asarray(X_test, dtype=np.float)
X_val = np.asarray(X_val, dtype=np.float)
y_train = np.asarray(y_train, dtype=np.uint8)
y_test = np.asarray(y_test, dtype=np.uint8)
y_val = np.asarray(y_val, dtype=np.uint8)

I can use one-hot encoded labels in Neural Networks and KNN without any failure.

Here is my KNN classification code:

# create model
model = KNeighborsClassifier(metric="manhattan", n_neighbors=1)

# training
model.fit(X_train, y_train)

# testing
y_pred = model.predict(X_test)

print(">>> Accuracy Score (%)")
print(accuracy_score(y_test, y_pred, normalize=False) / len(y_test) * 100, '
')

print(">>> Classification Report")
print(classification_report(y_test.argmax(axis=1), y_pred.argmax(axis=1)))

But, when I use one hot encoded labels with GaussianNB, I get ValueError: bad input shape ()

Here is the code:

# create model
model = GaussianNB()

# training
model.fit(X_train, y_train)

The output:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in ()
      2 
      3 # training
----> 4 model.fit(X_train, y_train)
      5 
      6 # testing

1 frames
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
    795         return np.ravel(y)
    796 
--> 797     raise ValueError("bad input shape {0}".format(shape))
    798 
    799 

ValueError: bad input shape (14280, 51)

I couldn't find why I am getting this error.

I can use GaussianNB with inversing one-hot encoded labels before creating model:

# inverse one hot encoding
y_train = enc.inverse_transform(y_train)
y_test = enc.inverse_transform(y_test)

but, I get a data conversion warning and 67% accuracy despite other models are 80%:

>>> Accuracy Score (%)
67.25490196078432 

>>> Classification Report
              precision    recall  f1-score   support

        s002       0.22      0.34      0.27        71
        s003       0.75      0.74      0.74        57
        s004       0.61      0.74      0.67        54
        s005       0.60      0.74      0.66        62
        s007       0.53      0.79      0.64        63
        s008       0.37      0.74      0.50        66
        s010       0.87      0.93      0.90        56
        s011       0.64      0.82      0.72        60
        s012       0.62      0.76      0.68        62
        s013       0.63      0.80      0.70        59
        s015       0.67      0.62      0.65        56
        s016       0.56      0.68      0.62        53
        s017       0.83      0.80      0.81        54
        s018       0.75      0.53      0.62        62
        s019       0.90      0.83      0.87        66
        s020       0.60      0.25      0.35        61
        s021       0.58      0.50      0.54        50
        s022       0.90      0.99      0.94        76
        s024       0.86      0.75      0.80        51
        s025       0.82      0.90      0.86        50
        s026       0.93      0.76      0.84        68
        s027       0.83      0.72      0.77        75
        s028       0.84      0.88      0.86        49
        s029       0.78      0.77      0.77        69
        s030       0.79      0.77      0.78        62
        s031       0.31      0.23      0.26        66
        s032       0.26      0.08      0.12        63
        s033       0.71      0.96      0.82        55
        s034       0.72      0.34      0.46        67
        s035       0.85      0.42      0.56        67
        s036       1.00      0.98      0.99        61
        s037       0.59      0.42      0.49        64
        s038       0.64      0.45      0.53        64
        s039       0.93      0.49      0.64        55
        s040       0.80      0.71      0.75        62
        s041       0.70      0.62      0.66        50
        s042       0.97      0.91      0.94        64
        s043       1.00      0.90      0.94        67
        s044       0.71      0.80      0.75        50
        s046       0.40      0.33      0.36        55
        s047       0.40      0.56      0.47        54
        s048       0.45      0.72      0.56        54
        s049       0.65      0.46      0.53        68
        s050       0.57      0.55      0.56        53
        s051       0.52      0.76      0.62        54
        s052       0.98      0.93      0.95        57
        s053       0.98      0.89      0.93        55
        s054       0.50      0.71      0.58        70
        s055       0.98      0.85      0.91        62
        s056       0.52      0.65      0.58        49
        s057       0.74      0.60      0.66        62

    accuracy                           0.67      3060
   macro avg       0.69      0.68      0.67      3060
weighted avg       0.69      0.67      0.67      3060

/usr/local/lib/python3.6/dist-packages/sklearn/naive_bayes.py:206: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

Can I use one-hot encoded labels in sklearn GaussianNB? Where am I making a mistake? What is the solution?

Thank you for your help!

Zabir Al Nazi Nabil · Accepted Answer

Because fit expects the numeric labels not one-hot-encoded labels.

Just remove this part.

# one hot encoding
enc = OneHotEncoder()
enc.fit(y)
y = enc.transform(y).toarray()

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

fit(self, X, y, sample_weight=None)[source]

    Fit Gaussian Naive Bayes according to X, y

    Parameters

        Xarray-like, shape (n_samples, n_features)

            Training vectors, where n_samples is the number of samples and n_features is the number of features.
        yarray-like, shape (n_samples,)

            Target values.
        sample_weightarray-like, shape (n_samples,), optional (default=None)

            Weights applied to individual samples (1. for unweighted).

            New in version 0.17: Gaussian Naive Bayes supports fitting with sample_weight.

    Returns

        selfobject

Error when using one-hot encoded labels in sklearn GaussianNB

Answers (1)

Related Questions