Reputation: 2859
I'm creating an NLP
model in which I use tokenizing
num_words = 5000
tokenizer = Tokenizer(num_words=num_words)
tokenizer.fit_on_texts(data)
Then I convert the texts to sequences
, calculate max_tokens
to determine the input dimension, and pad
them:
X_train_tokens = tokenizer.texts_to_sequences(X_train)
X_test_tokens = tokenizer.texts_to_sequences(X_test)
num_tokens = [len(tokens) for tokens in X_train_tokens + X_test_tokens]
num_tokens = np.array(num_tokens)
max_tokens = np.mean(num_tokens) + (2 * np.std(num_tokens))
max_tokens = int(max_tokens)
X_train_pad = pad_sequences(X_train_tokens, maxlen=max_tokens)
X_test_pad = pad_sequences(X_test_tokens, maxlen=max_tokens)
Then I build a Keras model and save it.
Then, I load the model I trained with above information. However, this time there is no tokenizer to prepare my text input, and I don't know what the input_dim is, since these are two different Python classes.
How can I use the information which I used in training, in test? What is the correct methodology for NLP testing? How can I use the same fitted tokenizer and calculated num_words on a separate Python file?
Upvotes: 0
Views: 93
Reputation: 2694
You have to save your fitted Tokenizer as well. This is easily done with pickle:
import pickle
pickle.dump( tokenizer, open( "tokenizer.pickle", "wb" ) )
And then when you load your keras model, load the tokenizer as well:
tokenizer = pickle.load( open( tokenizer.pickle", "rb" ) )
Ideally you would wrap the tokenizer and the model to a pipeline, so you can save and load them together and won't forget to save something essential.
Upvotes: 1