Jonathan Bechtel
Jonathan Bechtel

Reputation: 3617

How to create an NLP processing pipeline with Keras

I regularly use scikit-learn pipelines to streamline model processing, and I'm wondering the easiest way to do something similar with Keras in Tensorflow 2.0.

What I'd like to do is deploy a Keras model as an API endpoint, and then submit a piece of text in a numpy array to it and have it tokenized, padded and predicted. But I don't know the shortest path to do this.

Here's some sample code:

from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, Dense, Flatten
import numpy as np

sample_words = [
'The sky is blue',
'The sky delivers us many gifts',
'Wise men appreciate gifts for what they are, not what they are not',
'Wherever you go, there you are',
'Don\'t pass judgment onto others, or you will quickly be judged yourself'
]

y = np.array([1, 0, 1, 1, 0])

tokenizer = Tokenizer(num_words=10)
tokenizer.fit_on_texts(sample_words)

train_sequences = tokenizer.texts_to_sequences(sample_words)

train_sequences = pad_sequences(train_sequences, maxlen=7)
  mod = Sequential([
  Embedding(10, 2, input_length=7),
  Flatten(),
  Dense(3, activation='relu'),
  Dense(1, activation='sigmoid')
])

mod.compile(optimizer='adam', loss='binary_crossentropy')
mod.fit(train_sequences, y)

The idea is that if I have a web form and someone submits a form with the words 'The sky is pretty today', I can wrap it in a numpy array, send it to the endpoint (which will be setup on Google Cloud), and have it padded, tokenized, and predicted.

In scikit learn it would be as simple as: pipe = make_pipeline(tokenizer, mod), and then go from there.

I have a feeling there are some solutions that include td.Datasets, but I was hoping keras had something in it that was more user friendly.

Upvotes: 2

Views: 718

Answers (1)

mb0850
mb0850

Reputation: 593

Keras is easy in a way that there is no need to explicitly build any pipelines.

The Keras model is using Tensorflow backend to create a computation graph which could be loosely said as similar to scikit-learn's pipeline.

Thus your mod is in itself equivalent to a pipeline having the operations: Embedding -> Flatten -> Dense -> Dense. The mod.compile() method is generating the tensorflow computation graph.

Then everything comes together in model.fit() method where you plug in your inputs to your model (i.e. pipeline) and then the method trains on your data.

In order to have the tokenization be a part of your model, the TextVectorization layer can be used.

This layer has basic options for managing text in a Keras model. It transforms a batch of strings (one sample = one string) into either a list of token indices (one sample = 1D tensor of integer token indices) or a dense representation (one sample = 1D tensor of float values representing data about the sample's tokens)

Code snapshot:

vectorize_layer = TextVectorization(
   max_tokens=max_features,
   output_mode='int',
   output_sequence_length=max_len
)
model.add(vectorize_layer)
input_data = [["foo qux bar"], ["qux baz"]]
model.predict(input_data)
>>>
array([[2, 1, 4, 0],
   [1, 3, 0, 0]])

Upvotes: 1

Related Questions