Testing on Existing NLP Model

Question

I'm creating an NLP model in which I use tokenizing

num_words = 5000
tokenizer = Tokenizer(num_words=num_words)
tokenizer.fit_on_texts(data)

Then I convert the texts to sequences, calculate max_tokens to determine the input dimension, and pad them:

X_train_tokens = tokenizer.texts_to_sequences(X_train)
X_test_tokens = tokenizer.texts_to_sequences(X_test)

num_tokens = [len(tokens) for tokens in X_train_tokens + X_test_tokens]
num_tokens = np.array(num_tokens)

max_tokens = np.mean(num_tokens) + (2 * np.std(num_tokens))
max_tokens = int(max_tokens)

X_train_pad = pad_sequences(X_train_tokens, maxlen=max_tokens)
X_test_pad = pad_sequences(X_test_tokens, maxlen=max_tokens)

Then I build a Keras model and save it.

Then, I load the model I trained with above information. However, this time there is no tokenizer to prepare my text input, and I don't know what the input_dim is, since these are two different Python classes.

How can I use the information which I used in training, in test? What is the correct methodology for NLP testing? How can I use the same fitted tokenizer and calculated num_words on a separate Python file?

chefhose · Accepted Answer

You have to save your fitted Tokenizer as well. This is easily done with pickle:

import pickle
pickle.dump( tokenizer, open( "tokenizer.pickle", "wb" ) )

And then when you load your keras model, load the tokenizer as well:

tokenizer = pickle.load( open( tokenizer.pickle", "rb" ) )

Ideally you would wrap the tokenizer and the model to a pipeline, so you can save and load them together and won't forget to save something essential.

Testing on Existing NLP Model

Answers (1)

Related Questions