Reputation: 890
I am working on a TextClassification problem, for which I am trying to traing my model on TFBertForSequenceClassification given in huggingface-transformers library.
I followed the example given on their github page, I am able to run the sample code with given sample data using tensorflow_datasets.load('glue/mrpc')
.
However, I am unable to find an example on how to load my own custom data and pass it in
model.fit(train_dataset, epochs=2, steps_per_epoch=115, validation_data=valid_dataset, validation_steps=7)
.
How can I define my own X, do tokenization of my X and prepare train_dataset with my X and Y. Where X represents my input text and Y represents classification category of given X.
Sample Training dataframe :
text category_index
0 Assorted Print Joggers - Pack of 2 ,/ Gray Pri... 0
1 "Buckle" ( Matt ) for 35 mm Width Belt 0
2 (Gagam 07) Barcelona Football Jersey Home 17 1... 2
3 (Pack of 3 Pair) Flocklined Reusable Rubber Ha... 1
4 (Summer special Offer)Firststep new born baby ... 0
Upvotes: 12
Views: 12377
Reputation: 22326
There are multiple approaches to fine-tune BERT for the target tasks.
Note that the BERT base model has been pre-trained only for two tasks as in the original paper.
3.1 Pre-training BERT ...we pre-train BERT using two unsupervised tasks
- Task #1: Masked LM
- Task #2: Next Sentence Prediction (NSP)
Hence, the base BERT model is like half-baked which can be fully baked for the target domain (1st way). We can use it as part of our custom model training with the base trainable (2nd) or not-trainable (3rd).
How to Fine-Tune BERT for Text Classification? demonstrated the 1st approach of Further Pre-training, and pointed out the learning rate is the key to avoid Catastrophic Forgetting where the pre-trained knowledge is erased during learning of new knowledge.
We find that a lower learning rate, such as 2e-5, is necessary to make BERT overcome the catastrophic forgetting problem. With an aggressive learn rate of 4e-4, the training set fails to converge.
Probably this is the reason why the BERT paper used 5e-5, 4e-5, 3e-5, and 2e-5 for fine-tuning.
We use a batch size of 32 and fine-tune for 3 epochs over the data for all GLUE tasks. For each task, we selected the best fine-tuning learning rate (among 5e-5, 4e-5, 3e-5, and 2e-5) on the Dev set
Note that the base model pre-training itself used higher learning rate.
The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer used is Adam with a learning rate of
1e-4
, β1=0.9
and β2=0.999
, a weight decay of0.01
, learning rate warmup for 10,000 steps and linear decay of the learning rate after.
Will describe the 1st way as part of the 3rd approach below.
FYI:
TFDistilBertModel is the bare base model with the name distilbert
.
Model: "tf_distil_bert_model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
distilbert (TFDistilBertMain multiple 66362880
=================================================================
Total params: 66,362,880
Trainable params: 66,362,880
Non-trainable params: 0
Huggingface takes the 2nd approach as in Fine-tuning with native PyTorch/TensorFlow where TFDistilBertForSequenceClassification
has added the custom classification layer classifier
on top of the base distilbert
model being trainable. The small learning rate requirement will apply as well to avoid the catastrophic forgetting.
from transformers import TFDistilBertForSequenceClassification
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)
Model: "tf_distil_bert_for_sequence_classification_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
distilbert (TFDistilBertMain multiple 66362880
_________________________________________________________________
pre_classifier (Dense) multiple 590592
_________________________________________________________________
classifier (Dense) multiple 1538
_________________________________________________________________
dropout_59 (Dropout) multiple 0
=================================================================
Total params: 66,955,010
Trainable params: 66,955,010 <--- All parameters are trainable
Non-trainable params: 0
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import (
DistilBertTokenizerFast,
TFDistilBertForSequenceClassification,
)
DATA_COLUMN = 'text'
LABEL_COLUMN = 'category_index'
MAX_SEQUENCE_LENGTH = 512
LEARNING_RATE = 5e-5
BATCH_SIZE = 16
NUM_EPOCHS = 3
# --------------------------------------------------------------------------------
# Tokenizer
# --------------------------------------------------------------------------------
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
def tokenize(sentences, max_length=MAX_SEQUENCE_LENGTH, padding='max_length'):
"""Tokenize using the Huggingface tokenizer
Args:
sentences: String or list of string to tokenize
padding: Padding method ['do_not_pad'|'longest'|'max_length']
"""
return tokenizer(
sentences,
truncation=True,
padding=padding,
max_length=max_length,
return_tensors="tf"
)
# --------------------------------------------------------------------------------
# Load data
# --------------------------------------------------------------------------------
raw_train = pd.read_csv("./train.csv")
train_data, validation_data, train_label, validation_label = train_test_split(
raw_train[DATA_COLUMN].tolist(),
raw_train[LABEL_COLUMN].tolist(),
test_size=.2,
shuffle=True
)
# --------------------------------------------------------------------------------
# Prepare TF dataset
# --------------------------------------------------------------------------------
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(tokenize(train_data)), # Convert BatchEncoding instance to dictionary
train_label
)).shuffle(1000).batch(BATCH_SIZE).prefetch(1)
validation_dataset = tf.data.Dataset.from_tensor_slices((
dict(tokenize(validation_data)),
validation_label
)).batch(BATCH_SIZE).prefetch(1)
# --------------------------------------------------------------------------------
# training
# --------------------------------------------------------------------------------
model = TFDistilBertForSequenceClassification.from_pretrained(
'distilbert-base-uncased',
num_labels=NUM_LABELS
)
optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(
optimizer=optimizer,
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
)
model.fit(
x=train_dataset,
y=None,
validation_data=validation_dataset,
batch_size=BATCH_SIZE,
epochs=NUM_EPOCHS,
)
Please note that the images are taken from A Visual Guide to Using BERT for the First Time and modified.
Tokenizer generates the instance of BatchEncoding which can be used like a Python dictionary and the input to the BERT model.
Holds the output of the encode_plus() and batch_encode() methods (tokens, attention_masks, etc).
This class is derived from a python dictionary and can be used as a dictionary. In addition, this class exposes utility methods to map from word/character space to token space.
Parameters
- data (dict) – Dictionary of lists/arrays/tensors returned by the encode/batch_encode methods (‘input_ids’, ‘attention_mask’, etc.).
The data
attribute of the class is the tokens generated which has input_ids
and attention_mask
elements.
The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.
This argument indicates to the model which tokens should be attended to, and which should not.
If the attention_mask is 0
, the token id is ignored. For instance if a sequence is padded to adjust the sequence length, the padded words should be ignored hence their attention_mask are 0.
BertTokenizer addes special tokens, enclosing a sequence with [CLS]
and [SEP]
. [CLS]
represents Classification and [SEP]
separates sequences. For Question Answer or Paraphrase tasks, [SEP]
separates the two sentences to compare.
- cls_token (str, optional, defaults to "[CLS]")
The Classifier Token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.- sep_token (str, optional, defaults to "[SEP]")
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.
A Visual Guide to Using BERT for the First Time show the tokenization.
The embedding vector for [CLS]
in the output from the base model final layer represents the classification that has been learned by the base model. Hence feed the embedding vector of [CLS]
token into the classification layer added on top of the base model.
The first token of every sequence is always
a special classification token ([CLS])
. The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B.
The model structure will be illustrated as below.
In the model distilbert-base-uncased
, each token is embedded into a vector of size 768. The shape of the output from the base model is (batch_size, max_sequence_length, embedding_vector_size=768)
. This accords with the BERT paper about the BERT/BASE model (as indicated in distilbert-base-uncased).
BERT/BASE (L=12, H=768, A=12, Total Parameters=110M) and BERT/LARGE (L=24, H=1024, A=16, Total Parameters=340M).
TFDistilBertModel class to instantiate the base DistilBERT model without any specific head on top (as opposed to other classes such as TFDistilBertForSequenceClassification that do have an added classification head).
We do not want any task-specific head attached because we simply want the pre-trained weights of the base model to provide a general understanding of the English language, and it will be our job to add our own classification head during the fine-tuning process in order to help the model distinguish between toxic comments.
TFDistilBertModel
generates an instance of TFBaseModelOutput
whose last_hidden_state
parameter is the output from the model last layer.
TFBaseModelOutput([(
'last_hidden_state',
<tf.Tensor: shape=(batch_size, sequence_lendgth, 768), dtype=float32, numpy=array([[[...]]], dtype=float32)>
)])
Parameters
- last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import (
DistilBertTokenizerFast,
TFDistilBertModel,
)
TIMESTAMP = datetime.datetime.now().strftime("%Y%b%d%H%M").upper()
DATA_COLUMN = 'text'
LABEL_COLUMN = 'category_index'
MAX_SEQUENCE_LENGTH = 512 # Max length allowed for BERT is 512.
NUM_LABELS = len(raw_train[LABEL_COLUMN].unique())
MODEL_NAME = 'distilbert-base-uncased'
NUM_BASE_MODEL_OUTPUT = 768
# Flag to freeze base model
FREEZE_BASE = True
# Flag to add custom classification heads
USE_CUSTOM_HEAD = True
if USE_CUSTOM_HEAD == False:
# Make the base trainable when no classification head exists.
FREEZE_BASE = False
BATCH_SIZE = 16
LEARNING_RATE = 1e-2 if FREEZE_BASE else 5e-5
L2 = 0.01
tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_NAME)
def tokenize(sentences, max_length=MAX_SEQUENCE_LENGTH, padding='max_length'):
"""Tokenize using the Huggingface tokenizer
Args:
sentences: String or list of string to tokenize
padding: Padding method ['do_not_pad'|'longest'|'max_length']
"""
return tokenizer(
sentences,
truncation=True,
padding=padding,
max_length=max_length,
return_tensors="tf"
)
The base model expects input_ids
and attention_mask
whose shape is (max_sequence_length,)
. Generate Keras Tensors for them with Input
layer respectively.
# Inputs for token indices and attention masks
input_ids = tf.keras.layers.Input(shape=(MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='input_ids')
attention_mask = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='attention_mask')
Generate the output from the base model. The base model generates TFBaseModelOutput
. Feed the embedding of [CLS]
to the next layer.
base = TFDistilBertModel.from_pretrained(
MODEL_NAME,
num_labels=NUM_LABELS
)
# Freeze the base model weights.
if FREEZE_BASE:
for layer in base.layers:
layer.trainable = False
base.summary()
# [CLS] embedding is last_hidden_state[:, 0, :]
output = base([input_ids, attention_mask]).last_hidden_state[:, 0, :]
if USE_CUSTOM_HEAD:
# -------------------------------------------------------------------------------
# Classifiation leayer 01
# --------------------------------------------------------------------------------
output = tf.keras.layers.Dropout(
rate=0.15,
name="01_dropout",
)(output)
output = tf.keras.layers.Dense(
units=NUM_BASE_MODEL_OUTPUT,
kernel_initializer='glorot_uniform',
activation=None,
name="01_dense_relu_no_regularizer",
)(output)
output = tf.keras.layers.BatchNormalization(
name="01_bn"
)(output)
output = tf.keras.layers.Activation(
"relu",
name="01_relu"
)(output)
# --------------------------------------------------------------------------------
# Classifiation leayer 02
# --------------------------------------------------------------------------------
output = tf.keras.layers.Dense(
units=NUM_BASE_MODEL_OUTPUT,
kernel_initializer='glorot_uniform',
activation=None,
name="02_dense_relu_no_regularizer",
)(output)
output = tf.keras.layers.BatchNormalization(
name="02_bn"
)(output)
output = tf.keras.layers.Activation(
"relu",
name="02_relu"
)(output)
output = tf.keras.layers.Dense(
units=NUM_LABELS,
kernel_initializer='glorot_uniform',
kernel_regularizer=tf.keras.regularizers.l2(l2=L2),
activation='softmax',
name="softmax"
)(output)
name = f"{TIMESTAMP}_{MODEL_NAME.upper()}"
model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output, name=name)
model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
metrics=['accuracy']
)
model.summary()
---
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_ids (InputLayer) [(None, 256)] 0
__________________________________________________________________________________________________
attention_mask (InputLayer) [(None, 256)] 0
__________________________________________________________________________________________________
tf_distil_bert_model (TFDistilB TFBaseModelOutput(la 66362880 input_ids[0][0]
attention_mask[0][0]
__________________________________________________________________________________________________
tf.__operators__.getitem_1 (Sli (None, 768) 0 tf_distil_bert_model[1][0]
__________________________________________________________________________________________________
01_dropout (Dropout) (None, 768) 0 tf.__operators__.getitem_1[0][0]
__________________________________________________________________________________________________
01_dense_relu_no_regularizer (D (None, 768) 590592 01_dropout[0][0]
__________________________________________________________________________________________________
01_bn (BatchNormalization) (None, 768) 3072 01_dense_relu_no_regularizer[0][0
__________________________________________________________________________________________________
01_relu (Activation) (None, 768) 0 01_bn[0][0]
__________________________________________________________________________________________________
02_dense_relu_no_regularizer (D (None, 768) 590592 01_relu[0][0]
__________________________________________________________________________________________________
02_bn (BatchNormalization) (None, 768) 3072 02_dense_relu_no_regularizer[0][0
__________________________________________________________________________________________________
02_relu (Activation) (None, 768) 0 02_bn[0][0]
__________________________________________________________________________________________________
softmax (Dense) (None, 2) 1538 02_relu[0][0]
==================================================================================================
Total params: 67,551,746
Trainable params: 1,185,794
Non-trainable params: 66,365,952 <--- Base BERT model is frozen
# --------------------------------------------------------------------------------
# Split data into training and validation
# --------------------------------------------------------------------------------
raw_train = pd.read_csv("./train.csv")
train_data, validation_data, train_label, validation_label = train_test_split(
raw_train[DATA_COLUMN].tolist(),
raw_train[LABEL_COLUMN].tolist(),
test_size=.2,
shuffle=True
)
# X = dict(tokenize(train_data))
# Y = tf.convert_to_tensor(train_label)
X = tf.data.Dataset.from_tensor_slices((
dict(tokenize(train_data)), # Convert BatchEncoding instance to dictionary
train_label
)).batch(BATCH_SIZE).prefetch(1)
V = tf.data.Dataset.from_tensor_slices((
dict(tokenize(validation_data)), # Convert BatchEncoding instance to dictionary
validation_label
)).batch(BATCH_SIZE).prefetch(1)
# --------------------------------------------------------------------------------
# Train the model
# https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit
# Input data x can be a dict mapping input names to the corresponding array/tensors,
# if the model has named inputs. Beware of the "names". y should be consistent with x
# (you cannot have Numpy inputs and tensor targets, or inversely).
# --------------------------------------------------------------------------------
history = model.fit(
x=X, # dictionary
# y=Y,
y=None,
epochs=NUM_EPOCHS,
batch_size=BATCH_SIZE,
validation_data=V,
)
To implement the 1st approach, change the configuration as below.
USE_CUSTOM_HEAD = False
Then FREEZE_BASE
is changed to False
and LEARNING_RATE
is changed to 5e-5
which will run Further Pre-training on the base BERT model.
For the 3rd approach, saving the model will cause issues. The save_pretrained method of the Huggingface Model cannot be used as the model is not a direct sub class from of Huggingface PreTrainedModel.
Keras save_model causes an error with the default save_traces=True
, or causes a different error with save_traces=True
when loading the model with Keras load_model.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-71-01d66991d115> in <module>()
----> 1 tf.keras.models.load_model(MODEL_DIRECTORY)
11 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/saving/saved_model/load.py in _unable_to_call_layer_due_to_serialization_issue(layer, *unused_args, **unused_kwargs)
865 'recorded when the object is called, and used when saving. To manually '
866 'specify the input shape/dtype, decorate the call function with '
--> 867 '`@tf.function(input_signature=...)`.'.format(layer.name, type(layer)))
868
869
ValueError: Cannot call custom layer tf_distil_bert_model of type <class 'tensorflow.python.keras.saving.saved_model.load.TFDistilBertModel'>, because the call function was not serialized to the SavedModel.Please try one of the following methods to fix this issue:
(1) Implement `get_config` and `from_config` in the layer/model class, and pass the object to the `custom_objects` argument when loading the model. For more details, see: https://www.tensorflow.org/guide/keras/save_and_serialize
(2) Ensure that the subclassed model or layer overwrites `call` and not `__call__`. The input shape and dtype will be automatically recorded when the object is called, and used when saving. To manually specify the input shape/dtype, decorate the call function with `@tf.function(input_signature=...)`.
Only Keras Model save_weights worked as far as I tested.
As far as I tested with Toxic Comment Classification Challenge, the 1st approach gave better recall (identify true toxic comment, true non-toxic comment). Code can be accessed as below. Please provide correction/suggestion if anything.
Upvotes: 12
Reputation: 22326
Expanding the answer from konstantin_doncov.
When instaitiating a model, you need to define the model inisitlization parameters that are defined in the Transformers configuration file. The base class is PretrainedConfig.
Base class for all configuration classes. Handles a few parameters common to all models’ configurations as well as methods for loading/downloading/saving configurations.
Each sub class has its own parameters. For instance, Bert pretrained models have the BertConfig.
This is the configuration class to store the configuration of a BertModel or a TFBertModel. It is used to instantiate a BERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the BERT bert-base-uncased architecture.
For instance, the num_labels
parameter is from the PretrainedConfig
num_labels (int, optional) – Number of labels to use in the last layer added to the model, typically for a classification task.
TFBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
The configuration file for the model bert-base-uncased
is published at Huggingface model - bert-base-uncased - config.json.
{
"architectures": [
"BertForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"transformers_version": "4.6.0.dev0",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 30522
}
There are couple of examples provided from Huggngface for fine-tuning on your own custom datasets. For instance, utilize the Sequence Classification capabilty of BERT for the text classification.
This tutorial will take you through several examples of using 🤗 Transformers models with your own datasets.
How to fine-tune a pretrained model from the Transformers library. In TensorFlow, models can be directly trained using Keras and the fit method.
However, the examples in the documentation are overviews and lack of detail information.
Fine-tuning with native PyTorch/TensorFlow
from transformers import TFDistilBertForSequenceClassification
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)
The github provides complete codes.
This folder contains some scripts showing examples of text classification with the hugs Transformers library.
run_text_classification.py is the example for text classification fine-tuning for TensorFlow.
However, this is not simple nor straightforward as it is intended to be generic and all-purpose usage. Hence there is not a good example for people to get started with, causing the situations where people need to raise questions like this one.
You would see transfer learning (fine tuning) articles explain adding the classification layers on top of the pre-trained base models, so did in the answer.
output = tf.keras.layers.Dense(num_labels, activation='softmax')(output)
However, the huggingface example in the document does not add any classification layers.
from transformers import TFDistilBertForSequenceClassification
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)
This is because the TFBertForSequenceClassification
has already added the layers.
the base DistilBERT model without any specific head on top (as opposed to other classes such as TFDistilBertForSequenceClassification that do have an added classification head).
If you show the Keras model summary, for instance TFDistilBertForSequenceClassification
, it shows the Dense and Dropout layers added on top of the base BERT model.
Model: "tf_distil_bert_for_sequence_classification_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
distilbert (TFDistilBertMain multiple 66362880
_________________________________________________________________
pre_classifier (Dense) multiple 590592
_________________________________________________________________
classifier (Dense) multiple 1538
_________________________________________________________________
dropout_59 (Dropout) multiple 0
=================================================================
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
There are a few discussions e.g. Fine Tune BERT Models but apparently the way of Huggingface is not freeze the base model parameters. As shown the Keras model summary abobe Non-trainable params: 0
.
To freeze the base distilbert
layer.
for _layer in model:
if _layer.name == 'distilbert':
print(f"Freezing model layer {_layer.name}")
_layer.trainable = False
print(_layer.name)
print(_layer.trainable)
---
Freezing model layer distilbert
distilbert
False <----------------
pre_classifier
True
classifier
True
dropout_99
True
Other resources to look into is Kaggle. Search with keyword "huggingface" "BERT" and you will find the working codes published for the competitions.
Upvotes: 0
Reputation: 2879
There are really not many good examples of HuggingFace
transformers with the custom dataset files.
Let's import the required libraries first:
import numpy as np
import pandas as pd
import sklearn.model_selection as ms
import sklearn.preprocessing as p
import tensorflow as tf
import transformers as trfs
And define the needed constants:
# Max length of encoded string(including special tokens such as [CLS] and [SEP]):
MAX_SEQUENCE_LENGTH = 64
# Standard BERT model with lowercase chars only:
PRETRAINED_MODEL_NAME = 'bert-base-uncased'
# Batch size for fitting:
BATCH_SIZE = 16
# Number of epochs:
EPOCHS=5
Now it's time to read the dataset:
df = pd.read_csv('data.csv')
Then define the required model from pretrained BERT for sequence classification:
def create_model(max_sequence, model_name, num_labels):
bert_model = trfs.TFBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
# This is the input for the tokens themselves(words from the dataset after encoding):
input_ids = tf.keras.layers.Input(shape=(max_sequence,), dtype=tf.int32, name='input_ids')
# attention_mask - is a binary mask which tells BERT which tokens to attend and which not to attend.
# Encoder will add the 0 tokens to the some sequence which smaller than MAX_SEQUENCE_LENGTH,
# and attention_mask, in this case, tells BERT where is the token from the original data and where is 0 pad token:
attention_mask = tf.keras.layers.Input((max_sequence,), dtype=tf.int32, name='attention_mask')
# Use previous inputs as BERT inputs:
output = bert_model([input_ids, attention_mask])[0]
# We can also add dropout as regularization technique:
#output = tf.keras.layers.Dropout(rate=0.15)(output)
# Provide number of classes to the final layer:
output = tf.keras.layers.Dense(num_labels, activation='softmax')(output)
# Final model:
model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output)
return model
Now we need to instantiate the model using defined function, and compile our model:
model = create_model(MAX_SEQUENCE_LENGTH, PRETRAINED_MODEL_NAME, df.target.nunique())
opt = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
Create a function for the tokenization(converting text to tokens):
def batch_encode(X, tokenizer):
return tokenizer.batch_encode_plus(
X,
max_length=MAX_SEQUENCE_LENGTH, # set the length of the sequences
add_special_tokens=True, # add [CLS] and [SEP] tokens
return_attention_mask=True,
return_token_type_ids=False, # not needed for this type of ML task
pad_to_max_length=True, # add 0 pad tokens to the sequences less than max_length
return_tensors='tf'
)
Load the tokenizer:
tokenizer = trfs.BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)
Split the data into train and validation parts:
X_train, X_val, y_train, y_val = ms.train_test_split(df.text.values, df.category_index.values, test_size=0.2)
Encode our sets:
X_train = batch_encode(X_train)
X_val = batch_encode(X_val)
Finally, we can fit our model using train set and validate after each epoch using validation set:
model.fit(
x=X_train.values(),
y=y_train,
validation_data=(X_val.values(), y_val),
epochs=EPOCHS,
batch_size=BATCH_SIZE
)
Upvotes: 7
Reputation: 1706
You need to transform your input data in the tf.data
format with the expected schema so you can first create the features and then train your classification model.
If you look at the glue datasets which are coming for tensorflow_datasets
linkyou will see that the data have a specific schema:
dataset_ops.get_legacy_output_classes(data['train'])
{'idx': tensorflow.python.framework.ops.Tensor,
'label': tensorflow.python.framework.ops.Tensor,
'sentence': tensorflow.python.framework.ops.Tensor}
such schema is expected if you want to use convert_examples_to_features
to prepare the data ready to be injected in your model.
Transforming the data is not as straigtforward as with pandas for example and it will heavily depend of the structure of your input data.
For example you can find here a step by step do do such transformation. This can be done using tf.data.Dataset.from_generator
.
Upvotes: 0