kurious
kurious

Reputation: 1044

ValueError Inconsistent number of samples error with MultinomialNB

I need to create a model that classifies records accurately based on a variable. For instance, if a record has predictor A or B, I want it to be classified as having predicted value X. The actual data is in this form:

    Predicted    Predictor
      X            A
      X            B
      Y            D
      X            A

For my solution, I did the following: 1. Used LabelEncoder to create numerical values for the Predicted column 2. The predictor variable has multiple categories, which I parsed into individual columns using get_dummies.

Here is a sub-section of the dataframe with the (dummy)Predictor and a couple of predictor categories (pardon the misalignment):

    Predicted Predictor_A    Predictor_B
9056    30  0   0
2482    74  1   0
3407    56  1   0
12882   15  0   0
7988    30  0   0
13032   12  0   0
9738    28  0   0
6739    40  0   0
373 131 0   0
3030    62  0   0
8964    30  0   0
691 125 0   0
6214    41  0   0
6438    41  1   0
5060    42  0   0
3703    49  0   0
12461   16  0   0
2235    75  0   0
5107    42  0   0
4464    46  0   0
7075    39  1   0
11891   16  0   0
9190    30  0   0
8312    30  0   0
10328   24  0   0
1602    97  0   0
8804    30  0   0
8286    30  0   0
6821    40  0   0
3953    46  1   

After reshaping the data into the datframe as shown above, I try using MultinomialNB from sklearn. When doing so, the error I run into is:

ValueError: Found input variables with inconsistent numbers of samples: [1, 8158]

I'm running into the error while trying it with a dataframe that has only 2 columns -> Predicted and Predictor_A

My questions are:

  1. What do I need to do resolve the error?
  2. Is my approach correct?

Upvotes: 1

Views: 177

Answers (1)

seralouk
seralouk

Reputation: 33147

  • To fit the MultinomialNB model, you need the training samples and their features and their corresponding labels (target values).

  • In your case, Predicted is the target variable and Predictor_A and Predictor_B are the features(predictors).


Example 1:

from sklearn.naive_bayes import MultinomialNB
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("dt.csv", delim_whitespace=True)

# X is the features
X = df[['Predictor_A','Predictor_B']]
#y is the labels or targets or classes 
y = df['Predicted']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

clf = MultinomialNB()
clf.fit(X_train, y_train)

clf.predict(X_test)

#array([30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30])

#this result makes sense if you look at X_test. all the samples are similar
print(X_test)

       Predictor_A  Predictor_B
8286             0            0
12461            0            0
6214             0            0
9190             0            0
373              0            0
3030             0            0
11891            0            0
9056             0            0
8804             0            0
6438             1            0

#get the probabilities 
clf.predict_proba(X_test)

Note 2: The data that I used can be found here


EDIT

If you train the model using some documents that have let's say 4 tags(predictors), then the new document that you want to predict should also have the same number of tags.

Example 2:

clf.fit(X, y)

here, X is a [29, 2] array. So we have 29 training samples(documents) and its has 2 tags(predictors)

clf.predict(X_new)

here, the X_new could be [n, 2]. So we can predict the classes on n new documents but these new documents should also have exactly 2 tags (predictors).

Upvotes: 1

Related Questions