asmgx
asmgx

Reputation: 7984

Error with TfidfVectorizer but ok with CountVectorizer

I have been working on this the whole day but no luck

I managed to eliminate the problem in one line of TfidfVectorizer

Here is my working code

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(xtrain) 

X_train_count = vectorizer.transform(xtrain)
X_test_count  = vectorizer.transform(xval)
X_train_count


from keras.models import Sequential
from keras import layers

input_dim = X_train_count.shape[1]  # Number of features

model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))


model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.summary()

history = model.fit(X_train_count, ytrain,
                    epochs=10,
                    verbose=False,
                    validation_data=(X_test_count, yval),
                    batch_size=10)

But when I change to

from sklearn.feature_extraction.text import TfidfVectorizer

#TF-IDF initializer
vectorizer = TfidfVectorizer(max_df=0.8, max_features=1000)

vectorizer.fit(xtrain) 

X_train_count = vectorizer.transform(xtrain)
X_test_count  = vectorizer.transform(xval)
X_train_count


from keras.models import Sequential
from keras import layers

input_dim = X_train_count.shape[1]  # Number of features

model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))


model.compile(loss='binary_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.summary()

history = model.fit(X_train_count, ytrain,
                    epochs=10,
                    verbose=False,
                    validation_data=(X_test_count, yval),
                    batch_size=10)

The only thing changed is this 2 lines

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.8, max_features=1000)

and then I get this error

InvalidArgumentError: indices[1] = [0,997] is out of order. Many sparse ops require sorted indices.
Use tf.sparse.reorder to create a correctly ordered copy.

[Op:SerializeManySparse]

How to fix that and why it is happening?

Upvotes: 3

Views: 1603

Answers (1)

Marco Cerliani
Marco Cerliani

Reputation: 22031

vectorizer.transform(...) produces a sparse array and this is not good for keras. you simply have to transform it in a simple array. this is simply possible with:

vectorizer.transform(...).toarray()

Upvotes: 8

Related Questions