Reputation: 8328
After I trained my model for the toxic challenge at Keras the accuracy of the prediction is bad. I'm not sure if I'm doing something wrong, but the accuracy during the training period was pretty good ~0.98.
How I trained
import sys, os, re, csv, codecs, numpy as np, pandas as pd
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
train = pd.read_csv('train.csv')
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_train = train["comment_text"]
max_features = 20000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
inp = Input(shape=(maxlen, ))
embed_size = 128
x = Embedding(max_features, embed_size)(inp)
x = LSTM(60, return_sequences=True,name='lstm_layer')(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.1)(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
batch_size = 32
epochs = 2
print(X_t[0])
model.fit(X_t,y, batch_size=batch_size, epochs=epochs, validation_split=0.1)
model.save("m.hdf5")
This is how I predict
model = load_model('m.hdf5')
list_sentences_train = np.array(["I love you Stackoverflow"])
max_features = 20000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
print(X_t)
print(model.predict(X_t))
Output
[[ 1.97086316e-02 9.36032447e-05 3.93966911e-03 5.16672269e-04 3.67353857e-03 1.28102733e-03]]
Upvotes: 1
Views: 210
Reputation: 33410
In inference (i.e. prediction) phase, you should use the same pre-processing steps you have used during training of the model. Therefore, you should not create a new Tokenizer
instance and fit it on your test data. Rather, if you want to be able to do prediction later with the same model, besides the model you must also save all the statistics you obtained from the training data like the vocabulary in Tokenizer
instance. Therefore it would be like this:
import pickle
# building and training of the model as you have done ...
# store all the data we need later: model and tokenizer
model.save("m.hdf5")
with open('tokenizer.pkl', 'wb') as handler:
pickle.dump(tokenizer, handler)
And now in prediction phase:
import pickle
model = load_model('m.hdf5')
with open('tokenizer.pkl', 'rb') as handler:
tokenizer = pickle.load(handler)
list_sentences_train = ["I love you Stackoverflow"]
# use the the same tokenizer instance you used in training phase
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
print(model.predict(X_t))
Upvotes: 1