BATspock
BATspock

Reputation: 152

Bad Input Shape Sklearn Error After HashingVectorizer

I have 204567 words of which 21010 are unique. Each word is associated with a unique tag. In total, there are 46 unique tags.

I have used feature hashing to map the 204567 words using HashingVectorizer(). I have one-hot encoded the tags and used Perceptron() model for this multi-class classification problem.

from keras.utils import np_utils
from sklearn.feature_extraction.text import HashingVectorizer 
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import LabelEncoder

vect = HashingVectorizer(decode_error='ignore', n_features=2**15,
                          preprocessor=None)
X = vect.transform(X_train)

encoder = LabelEncoder()
y = encoder.transform(y_train)
target = np_utils.to_categorical(y)

ppn = Perceptron(n_iter=40, eta0=0.1, random_state=0)
ppn.fit(X, target)

However, I receive the following error: ValueError: bad input shape (204567, 46)

Is there a better way to encode the tags?

P.S. Please, explain the error and a possible solution

Upvotes: 1

Views: 353

Answers (2)

BATspock
BATspock

Reputation: 152

I changed my code as follows and now its working:

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils
from numpy import array   

vec = HashingVectorizer(decode_error = 'ignore', n_features = 2**15)
X = vec.fit_transform(X_train) 

values = array(y_train)

label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)

encoded = np_utils.to_categorical(integer_encoded)
print(X.shape)
print(encoded.shape)

clf = MLPClassifier(activation = 'logistic', solver = 'adam', 
                    batch_size = 100, learning_rate = 'adaptive', 
                    max_iter = 20, random_state = 1, verbose = True )
clf.fit(X, encoded)
print('Accuracy: %.3f' %clf.score(X, encoded))

I changed my model from Perceptron to Multi Layer Perceptron Classifier though I am not completely sure how this is working. Explanations are welcome. Now I have to approach the same problem using n-gram model and compare the results.

Upvotes: 2

developer_hatch
developer_hatch

Reputation: 16224

the function np_utils.to_categorical() expects an class vector as parameter, you gived a shape

See docs:

to_categorical

to_categorical(y, num_classes=None)

Converts a class vector (integers) to binary class matrix.

E.g. for use with categorical_crossentropy.

Arguments

y: class vector to be converted into a matrix (integers from 0 to num_classes). num_classes: total number of classes. Returns

A binary matrix representation of the input.

so

target = np_utils.to_categorical(y)

gives you an error type

Upvotes: 0

Related Questions