Bad accuracy when prediction happens

Question

After I trained my model for the toxic challenge at Keras the accuracy of the prediction is bad. I'm not sure if I'm doing something wrong, but the accuracy during the training period was pretty good ~0.98.

How I trained

import sys, os, re, csv, codecs, numpy as np, pandas as pd
import matplotlib.pyplot as plt
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

train = pd.read_csv('train.csv')


list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_train = train["comment_text"]

max_features = 20000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)

maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)

inp = Input(shape=(maxlen, ))

embed_size = 128
x = Embedding(max_features, embed_size)(inp)
x = LSTM(60, return_sequences=True,name='lstm_layer')(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.1)(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)

model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

batch_size = 32
epochs = 2
print(X_t[0])
model.fit(X_t,y, batch_size=batch_size, epochs=epochs, validation_split=0.1)

model.save("m.hdf5")

This is how I predict

model = load_model('m.hdf5')

list_sentences_train = np.array(["I love you Stackoverflow"])

max_features = 20000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)

maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)

print(X_t)

print(model.predict(X_t))

Output

[[ 1.97086316e-02 9.36032447e-05 3.93966911e-03 5.16672269e-04 3.67353857e-03 1.28102733e-03]]

today · Accepted Answer

In inference (i.e. prediction) phase, you should use the same pre-processing steps you have used during training of the model. Therefore, you should not create a new Tokenizer instance and fit it on your test data. Rather, if you want to be able to do prediction later with the same model, besides the model you must also save all the statistics you obtained from the training data like the vocabulary in Tokenizer instance. Therefore it would be like this:

import pickle

# building and training of the model as you have done ...

# store all the data we need later: model and tokenizer    
model.save("m.hdf5")
with open('tokenizer.pkl', 'wb') as handler:
    pickle.dump(tokenizer, handler)

And now in prediction phase:

import pickle

model = load_model('m.hdf5')
with open('tokenizer.pkl', 'rb') as handler:
    tokenizer = pickle.load(handler)

list_sentences_train = ["I love you Stackoverflow"]

# use the the same tokenizer instance you used in training phase
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)

print(model.predict(X_t))

Bad accuracy when prediction happens

Answers (1)

Related Questions