Jane Sully
Jane Sully

Reputation: 3337

How to specify input sequence length for BERT tokenizer in Tensorflow?

I am following this example to use BERT for sentiment classification.

text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
preprocessor = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3") # 128 by default
encoder_inputs = preprocessor(text_input)
encoder = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4",
    trainable=True)
outputs = encoder(encoder_inputs)
pooled_output = outputs["pooled_output"]      # [batch_size, 768].
sequence_output = outputs["sequence_output"]  # [batch_size, seq_length, 768].
embedding_model = tf.keras.Model(text_input, pooled_output)sentences = tf.constant(["(your text here)"])print(embedding_model(sentences))

The sequence length by default seems to 128 from looking at the output shape from encoder_inputs. However, I’m not sure how to change this? Ideally I’d like to use to a larger sequence length.

There’s an example of modifying sequence length from the preprocessor page, but I’m not sure how to incorporate this into the functional model definition I have above? I would greatly appreciate any help with this.

Upvotes: 0

Views: 1852

Answers (1)

ML_Engine
ML_Engine

Reputation: 1185

Just going off the docs here (haven't tested this), but you might do :

preprocessor = hub.load(
    "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")


text_inputs = [tf.keras.layers.Input(shape=(), dtype=tf.string)]

Doesn't look like you've tokenized your data above - see below

tokenize = hub.KerasLayer(preprocessor.tokenize)
tokenized_inputs = [tokenize(segment) for segment in text_inputs]

Next select your sequence length:

seq_length = 128  # Your choice here.

Here is where you pass in the seq_length:

bert_pack_inputs = hub.KerasLayer(
    preprocessor.bert_pack_inputs,
    arguments=dict(seq_length=seq_length))  # Optional argument.

Now encode your inputs by running bert_pack_inputs (this replaces the preprocessor(text_input) above)

encoder_inputs = bert_pack_inputs(tokenized_inputs)

And then the rest of your code:

encoder = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4",
    trainable=True)

outputs = encoder(encoder_inputs)
pooled_output = outputs["pooled_output"]      # [batch_size, 768].
sequence_output = outputs["sequence_output"]  # [batch_size, seq_length, 768].
embedding_model = tf.keras.Model(text_input, pooled_output)
sentences = tf.constant(["(your text here)"])
print(embedding_model(sentences))

Upvotes: 2

Related Questions