Reputation: 1044
I need to create a model that classifies records accurately based on a variable. For instance, if a record has predictor A
or B
, I want it to be classified as having predicted value X
. The actual data is in this form:
Predicted Predictor
X A
X B
Y D
X A
For my solution, I did the following:
1. Used LabelEncoder
to create numerical values for the Predicted
column
2. The predictor variable has multiple categories, which I parsed into individual columns using get_dummies
.
Here is a sub-section of the dataframe with the (dummy)Predictor
and a couple of predictor categories (pardon the misalignment):
Predicted Predictor_A Predictor_B
9056 30 0 0
2482 74 1 0
3407 56 1 0
12882 15 0 0
7988 30 0 0
13032 12 0 0
9738 28 0 0
6739 40 0 0
373 131 0 0
3030 62 0 0
8964 30 0 0
691 125 0 0
6214 41 0 0
6438 41 1 0
5060 42 0 0
3703 49 0 0
12461 16 0 0
2235 75 0 0
5107 42 0 0
4464 46 0 0
7075 39 1 0
11891 16 0 0
9190 30 0 0
8312 30 0 0
10328 24 0 0
1602 97 0 0
8804 30 0 0
8286 30 0 0
6821 40 0 0
3953 46 1
After reshaping the data into the datframe as shown above, I try using MultinomialNB
from sklearn
. When doing so, the error I run into is:
ValueError: Found input variables with inconsistent numbers of samples: [1, 8158]
I'm running into the error while trying it with a dataframe that has only 2 columns -> Predicted
and Predictor_A
My questions are:
Upvotes: 1
Views: 177
Reputation: 33147
To fit the MultinomialNB
model, you need the training samples and their features and their corresponding labels (target values).
In your case, Predicted
is the target
variable and Predictor_A and Predictor_B
are the features
(predictors).
Example 1:
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv("dt.csv", delim_whitespace=True)
# X is the features
X = df[['Predictor_A','Predictor_B']]
#y is the labels or targets or classes
y = df['Predicted']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf = MultinomialNB()
clf.fit(X_train, y_train)
clf.predict(X_test)
#array([30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30])
#this result makes sense if you look at X_test. all the samples are similar
print(X_test)
Predictor_A Predictor_B
8286 0 0
12461 0 0
6214 0 0
9190 0 0
373 0 0
3030 0 0
11891 0 0
9056 0 0
8804 0 0
6438 1 0
#get the probabilities
clf.predict_proba(X_test)
Note 2: The data that I used can be found here
If you train the model using some documents that have let's say 4 tags(predictors), then the new document that you want to predict should also have the same number of tags.
Example 2:
clf.fit(X, y)
here, X
is a [29, 2]
array. So we have 29
training samples(documents) and its has 2
tags(predictors)
clf.predict(X_new)
here, the X_new
could be [n, 2]
. So we can predict the classes on n
new documents but these new documents should also have exactly 2
tags (predictors).
Upvotes: 1