nlppytorchtensorflow2.0huggingface-transformersbert-language-model

Reputation: 890

Training TFBertForSequenceClassification with custom X and Y data

I am working on a TextClassification problem, for which I am trying to traing my model on TFBertForSequenceClassification given in huggingface-transformers library.

I followed the example given on their github page, I am able to run the sample code with given sample data using tensorflow_datasets.load('glue/mrpc'). However, I am unable to find an example on how to load my own custom data and pass it in model.fit(train_dataset, epochs=2, steps_per_epoch=115, validation_data=valid_dataset, validation_steps=7).

How can I define my own X, do tokenization of my X and prepare train_dataset with my X and Y. Where X represents my input text and Y represents classification category of given X.

Sample Training dataframe :

    text    category_index
0   Assorted Print Joggers - Pack of 2 ,/ Gray Pri...   0
1   "Buckle" ( Matt ) for 35 mm Width Belt  0
2   (Gagam 07) Barcelona Football Jersey Home 17 1...   2
3   (Pack of 3 Pair) Flocklined Reusable Rubber Ha...   1
4   (Summer special Offer)Firststep new born baby ...   0

Upvotes: 12

Answers (4)

mon

Reputation: 22326

Fine Tuning Approach

There are multiple approaches to fine-tune BERT for the target tasks.

Further Pre-training the base BERT model
Custom classification layer(s) on top of the base BERT model being trainable
Custom classification layer(s) on top of the base BERT model being non-trainable (frozen)

Note that the BERT base model has been pre-trained only for two tasks as in the original paper.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

3.1 Pre-training BERT ...we pre-train BERT using two unsupervised tasks

Task #1: Masked LM

Task #2: Next Sentence Prediction (NSP)

Hence, the base BERT model is like half-baked which can be fully baked for the target domain (1st way). We can use it as part of our custom model training with the base trainable (2nd) or not-trainable (3rd).

1st approach

How to Fine-Tune BERT for Text Classification? demonstrated the 1st approach of Further Pre-training, and pointed out the learning rate is the key to avoid Catastrophic Forgetting where the pre-trained knowledge is erased during learning of new knowledge.

We find that a lower learning rate, such as 2e-5, is necessary to make BERT overcome the catastrophic forgetting problem. With an aggressive learn rate of 4e-4, the training set fails to converge.

Probably this is the reason why the BERT paper used 5e-5, 4e-5, 3e-5, and 2e-5 for fine-tuning.

We use a batch size of 32 and fine-tune for 3 epochs over the data for all GLUE tasks. For each task, we selected the best fine-tuning learning rate (among 5e-5, 4e-5, 3e-5, and 2e-5) on the Dev set

Note that the base model pre-training itself used higher learning rate.

bert-base-uncased - pretraining

The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer used is Adam with a learning rate of 1e-4, β1=0.9 and β2=0.999, a weight decay of 0.01, learning rate warmup for 10,000 steps and linear decay of the learning rate after.

Will describe the 1st way as part of the 3rd approach below.

FYI: TFDistilBertModel is the bare base model with the name distilbert.

Model: "tf_distil_bert_model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
distilbert (TFDistilBertMain multiple                  66362880  
=================================================================
Total params: 66,362,880
Trainable params: 66,362,880
Non-trainable params: 0

2nd approach

Huggingface takes the 2nd approach as in Fine-tuning with native PyTorch/TensorFlow where TFDistilBertForSequenceClassification has added the custom classification layer classifier on top of the base distilbert model being trainable. The small learning rate requirement will apply as well to avoid the catastrophic forgetting.

from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)

Model: "tf_distil_bert_for_sequence_classification_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
distilbert (TFDistilBertMain multiple                  66362880  
_________________________________________________________________
pre_classifier (Dense)       multiple                  590592    
_________________________________________________________________
classifier (Dense)           multiple                  1538      
_________________________________________________________________
dropout_59 (Dropout)         multiple                  0         
=================================================================
Total params: 66,955,010
Trainable params: 66,955,010  <--- All parameters are trainable
Non-trainable params: 0

Implementation of the 2nd approach

import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import (
    DistilBertTokenizerFast,
    TFDistilBertForSequenceClassification,
)


DATA_COLUMN = 'text'
LABEL_COLUMN = 'category_index'
MAX_SEQUENCE_LENGTH = 512
LEARNING_RATE = 5e-5
BATCH_SIZE = 16
NUM_EPOCHS = 3


# --------------------------------------------------------------------------------
# Tokenizer
# --------------------------------------------------------------------------------
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
def tokenize(sentences, max_length=MAX_SEQUENCE_LENGTH, padding='max_length'):
    """Tokenize using the Huggingface tokenizer
    Args:
        sentences: String or list of string to tokenize
        padding: Padding method ['do_not_pad'|'longest'|'max_length']
    """
    return tokenizer(
        sentences,
        truncation=True,
        padding=padding,
        max_length=max_length,
        return_tensors="tf"
    )

# --------------------------------------------------------------------------------
# Load data
# --------------------------------------------------------------------------------
raw_train = pd.read_csv("./train.csv")
train_data, validation_data, train_label, validation_label = train_test_split(
    raw_train[DATA_COLUMN].tolist(),
    raw_train[LABEL_COLUMN].tolist(),
    test_size=.2,
    shuffle=True
)

# --------------------------------------------------------------------------------
# Prepare TF dataset
# --------------------------------------------------------------------------------
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(tokenize(train_data)),  # Convert BatchEncoding instance to dictionary
    train_label
)).shuffle(1000).batch(BATCH_SIZE).prefetch(1)
validation_dataset = tf.data.Dataset.from_tensor_slices((
    dict(tokenize(validation_data)),
    validation_label
)).batch(BATCH_SIZE).prefetch(1)

# --------------------------------------------------------------------------------
# training
# --------------------------------------------------------------------------------
model = TFDistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=NUM_LABELS
)
optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(
    optimizer=optimizer,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
)
model.fit(
    x=train_dataset,
    y=None,
    validation_data=validation_dataset,
    batch_size=BATCH_SIZE,
    epochs=NUM_EPOCHS,
)

3rd approach

Basics

Please note that the images are taken from A Visual Guide to Using BERT for the First Time and modified.

Tokenizer

Tokenizer generates the instance of BatchEncoding which can be used like a Python dictionary and the input to the BERT model.

BatchEncoding

Holds the output of the encode_plus() and batch_encode() methods (tokens, attention_masks, etc).
This class is derived from a python dictionary and can be used as a dictionary. In addition, this class exposes utility methods to map from word/character space to token space.

Parameters

data (dict) – Dictionary of lists/arrays/tensors returned by the encode/batch_encode methods (‘input_ids’, ‘attention_mask’, etc.).

The data attribute of the class is the tokens generated which has input_ids and attention_mask elements.

input_ids

input_ids

The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.

attention_mask

Attention mask

This argument indicates to the model which tokens should be attended to, and which should not.

If the attention_mask is 0, the token id is ignored. For instance if a sequence is padded to adjust the sequence length, the padded words should be ignored hence their attention_mask are 0.

Special Tokens

BertTokenizer addes special tokens, enclosing a sequence with [CLS] and [SEP]. [CLS] represents Classification and [SEP] separates sequences. For Question Answer or Paraphrase tasks, [SEP] separates the two sentences to compare.

BertTokenizer

cls_token (str, optional, defaults to "[CLS]")
The Classifier Token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.

sep_token (str, optional, defaults to "[SEP]")
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.

A Visual Guide to Using BERT for the First Time show the tokenization.

[CLS]

The embedding vector for [CLS] in the output from the base model final layer represents the classification that has been learned by the base model. Hence feed the embedding vector of [CLS] token into the classification layer added on top of the base model.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B.

The model structure will be illustrated as below.

Vector size

In the model distilbert-base-uncased, each token is embedded into a vector of size 768. The shape of the output from the base model is (batch_size, max_sequence_length, embedding_vector_size=768). This accords with the BERT paper about the BERT/BASE model (as indicated in distilbert-base-uncased).

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT/BASE (L=12, H=768, A=12, Total Parameters=110M) and BERT/LARGE (L=24, H=1024, A=16, Total Parameters=340M).

Base Model - TFDistilBertModel

Hugging Face Transformers: Fine-tuning DistilBERT for Binary Classification Tasks

TFDistilBertModel class to instantiate the base DistilBERT model without any specific head on top (as opposed to other classes such as TFDistilBertForSequenceClassification that do have an added classification head).

We do not want any task-specific head attached because we simply want the pre-trained weights of the base model to provide a general understanding of the English language, and it will be our job to add our own classification head during the fine-tuning process in order to help the model distinguish between toxic comments.

TFDistilBertModel generates an instance of TFBaseModelOutput whose last_hidden_state parameter is the output from the model last layer.

TFBaseModelOutput([(
    'last_hidden_state',
    <tf.Tensor: shape=(batch_size, sequence_lendgth, 768), dtype=float32, numpy=array([[[...]]], dtype=float32)>
)])

TFBaseModelOutput

Parameters

last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.

Implementation

Python modules

import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import (
    DistilBertTokenizerFast,
    TFDistilBertModel,
)

Configuration

TIMESTAMP = datetime.datetime.now().strftime("%Y%b%d%H%M").upper()

DATA_COLUMN = 'text'
LABEL_COLUMN = 'category_index'

MAX_SEQUENCE_LENGTH = 512   # Max length allowed for BERT is 512.
NUM_LABELS = len(raw_train[LABEL_COLUMN].unique())

MODEL_NAME = 'distilbert-base-uncased'
NUM_BASE_MODEL_OUTPUT = 768

# Flag to freeze base model
FREEZE_BASE = True

# Flag to add custom classification heads
USE_CUSTOM_HEAD = True
if USE_CUSTOM_HEAD == False:
    # Make the base trainable when no classification head exists.
    FREEZE_BASE = False


BATCH_SIZE = 16
LEARNING_RATE = 1e-2 if FREEZE_BASE else 5e-5
L2 = 0.01

Tokenizer

tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_NAME)
def tokenize(sentences, max_length=MAX_SEQUENCE_LENGTH, padding='max_length'):
    """Tokenize using the Huggingface tokenizer
    Args:
        sentences: String or list of string to tokenize
        padding: Padding method ['do_not_pad'|'longest'|'max_length']
    """
    return tokenizer(
        sentences,
        truncation=True,
        padding=padding,
        max_length=max_length,
        return_tensors="tf"
    )

Input layer

The base model expects input_ids and attention_mask whose shape is (max_sequence_length,). Generate Keras Tensors for them with Input layer respectively.

# Inputs for token indices and attention masks
input_ids = tf.keras.layers.Input(shape=(MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='input_ids')
attention_mask = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='attention_mask')

Base model layer

Generate the output from the base model. The base model generates TFBaseModelOutput. Feed the embedding of [CLS] to the next layer.

base = TFDistilBertModel.from_pretrained(
    MODEL_NAME,
    num_labels=NUM_LABELS
)

# Freeze the base model weights.
if FREEZE_BASE:
    for layer in base.layers:
        layer.trainable = False
    base.summary()

# [CLS] embedding is last_hidden_state[:, 0, :]
output = base([input_ids, attention_mask]).last_hidden_state[:, 0, :]

Classification layers

if USE_CUSTOM_HEAD:
    # -------------------------------------------------------------------------------
    # Classifiation leayer 01
    # --------------------------------------------------------------------------------
    output = tf.keras.layers.Dropout(
        rate=0.15,
        name="01_dropout",
    )(output)
    
    output = tf.keras.layers.Dense(
        units=NUM_BASE_MODEL_OUTPUT,
        kernel_initializer='glorot_uniform',
        activation=None,
        name="01_dense_relu_no_regularizer",
    )(output)
    output = tf.keras.layers.BatchNormalization(
        name="01_bn"
    )(output)
    output = tf.keras.layers.Activation(
        "relu",
        name="01_relu"
    )(output)

    # --------------------------------------------------------------------------------
    # Classifiation leayer 02
    # --------------------------------------------------------------------------------
    output = tf.keras.layers.Dense(
        units=NUM_BASE_MODEL_OUTPUT,
        kernel_initializer='glorot_uniform',
        activation=None,
        name="02_dense_relu_no_regularizer",
    )(output)
    output = tf.keras.layers.BatchNormalization(
        name="02_bn"
    )(output)
    output = tf.keras.layers.Activation(
        "relu",
        name="02_relu"
    )(output)

Softmax Layer

output = tf.keras.layers.Dense(
    units=NUM_LABELS,
    kernel_initializer='glorot_uniform',
    kernel_regularizer=tf.keras.regularizers.l2(l2=L2),
    activation='softmax',
    name="softmax"
)(output)

Final Custom Model

name = f"{TIMESTAMP}_{MODEL_NAME.upper()}"
model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output, name=name)
model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
    metrics=['accuracy']
)
model.summary()
---
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_ids (InputLayer)          [(None, 256)]        0                                            
__________________________________________________________________________________________________
attention_mask (InputLayer)     [(None, 256)]        0                                            
__________________________________________________________________________________________________
tf_distil_bert_model (TFDistilB TFBaseModelOutput(la 66362880    input_ids[0][0]                  
                                                                 attention_mask[0][0]             
__________________________________________________________________________________________________
tf.__operators__.getitem_1 (Sli (None, 768)          0           tf_distil_bert_model[1][0]       
__________________________________________________________________________________________________
01_dropout (Dropout)            (None, 768)          0           tf.__operators__.getitem_1[0][0] 
__________________________________________________________________________________________________
01_dense_relu_no_regularizer (D (None, 768)          590592      01_dropout[0][0]                 
__________________________________________________________________________________________________
01_bn (BatchNormalization)      (None, 768)          3072        01_dense_relu_no_regularizer[0][0
__________________________________________________________________________________________________
01_relu (Activation)            (None, 768)          0           01_bn[0][0]                      
__________________________________________________________________________________________________
02_dense_relu_no_regularizer (D (None, 768)          590592      01_relu[0][0]                    
__________________________________________________________________________________________________
02_bn (BatchNormalization)      (None, 768)          3072        02_dense_relu_no_regularizer[0][0
__________________________________________________________________________________________________
02_relu (Activation)            (None, 768)          0           02_bn[0][0]                      
__________________________________________________________________________________________________
softmax (Dense)                 (None, 2)            1538        02_relu[0][0]                    
==================================================================================================
Total params: 67,551,746
Trainable params: 1,185,794
Non-trainable params: 66,365,952   <--- Base BERT model is frozen

Data allocation

# --------------------------------------------------------------------------------
# Split data into training and validation
# --------------------------------------------------------------------------------
raw_train = pd.read_csv("./train.csv")
train_data, validation_data, train_label, validation_label = train_test_split(
    raw_train[DATA_COLUMN].tolist(),
    raw_train[LABEL_COLUMN].tolist(),
    test_size=.2,
    shuffle=True
)

# X = dict(tokenize(train_data))
# Y = tf.convert_to_tensor(train_label)
X = tf.data.Dataset.from_tensor_slices((
    dict(tokenize(train_data)),  # Convert BatchEncoding instance to dictionary
    train_label
)).batch(BATCH_SIZE).prefetch(1)

V = tf.data.Dataset.from_tensor_slices((
    dict(tokenize(validation_data)),  # Convert BatchEncoding instance to dictionary
    validation_label
)).batch(BATCH_SIZE).prefetch(1)

Train

# --------------------------------------------------------------------------------
# Train the model
# https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit
# Input data x can be a dict mapping input names to the corresponding array/tensors, 
# if the model has named inputs. Beware of the "names". y should be consistent with x 
# (you cannot have Numpy inputs and tensor targets, or inversely). 
# --------------------------------------------------------------------------------
history = model.fit(
    x=X,    # dictionary 
    # y=Y,
    y=None,
    epochs=NUM_EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=V,
)

To implement the 1st approach, change the configuration as below.

USE_CUSTOM_HEAD = False

Then FREEZE_BASE is changed to False and LEARNING_RATE is changed to 5e-5 which will run Further Pre-training on the base BERT model.

Saving the model

For the 3rd approach, saving the model will cause issues. The save_pretrained method of the Huggingface Model cannot be used as the model is not a direct sub class from of Huggingface PreTrainedModel.

Keras save_model causes an error with the default save_traces=True, or causes a different error with save_traces=True when loading the model with Keras load_model.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-71-01d66991d115> in <module>()
----> 1 tf.keras.models.load_model(MODEL_DIRECTORY)
 
11 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/saving/saved_model/load.py in _unable_to_call_layer_due_to_serialization_issue(layer, *unused_args, **unused_kwargs)
    865       'recorded when the object is called, and used when saving. To manually '
    866       'specify the input shape/dtype, decorate the call function with '
--> 867       '`@tf.function(input_signature=...)`.'.format(layer.name, type(layer)))
    868 
    869 
 
ValueError: Cannot call custom layer tf_distil_bert_model of type <class 'tensorflow.python.keras.saving.saved_model.load.TFDistilBertModel'>, because the call function was not serialized to the SavedModel.Please try one of the following methods to fix this issue:
 
(1) Implement `get_config` and `from_config` in the layer/model class, and pass the object to the `custom_objects` argument when loading the model. For more details, see: https://www.tensorflow.org/guide/keras/save_and_serialize
 
(2) Ensure that the subclassed model or layer overwrites `call` and not `__call__`. The input shape and dtype will be automatically recorded when the object is called, and used when saving. To manually specify the input shape/dtype, decorate the call function with `@tf.function(input_signature=...)`.

Only Keras Model save_weights worked as far as I tested.

Experiments

As far as I tested with Toxic Comment Classification Challenge, the 1st approach gave better recall (identify true toxic comment, true non-toxic comment). Code can be accessed as below. Please provide correction/suggestion if anything.

Code for 1st and 3rd approach

BERT Document Classification Tutorial with Code - Fine tuning using TFDistilBertForSequenceClassification and Pytorch
Hugging Face Transformers: Fine-tuning DistilBERT for Binary Classification Tasks - Fine tuning using TFDistilBertModel
Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution
Stanford CS330 Deep Multi-Task & Meta Learning - Transfer Learning, Meta Learning l 2022 I Lecture 3

Upvotes: 12

mon

Reputation: 22326

Expanding the answer from konstantin_doncov.

Config file

When instaitiating a model, you need to define the model inisitlization parameters that are defined in the Transformers configuration file. The base class is PretrainedConfig.

PretrainedConfig

Base class for all configuration classes. Handles a few parameters common to all models’ configurations as well as methods for loading/downloading/saving configurations.

Each sub class has its own parameters. For instance, Bert pretrained models have the BertConfig.

BertConfig

This is the configuration class to store the configuration of a BertModel or a TFBertModel. It is used to instantiate a BERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the BERT bert-base-uncased architecture.

For instance, the num_labels parameter is from the PretrainedConfig

num_labels (int, optional) – Number of labels to use in the last layer added to the model, typically for a classification task.

TFBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

The configuration file for the model bert-base-uncased is published at Huggingface model - bert-base-uncased - config.json.

{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.6.0.dev0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

Fine-Tuning (Transfer Learning)

There are couple of examples provided from Huggngface for fine-tuning on your own custom datasets. For instance, utilize the Sequence Classification capabilty of BERT for the text classification.

Fine-tuning with custom datasets

This tutorial will take you through several examples of using 🤗 Transformers models with your own datasets.

Fine-tuning a pretrained model

How to fine-tune a pretrained model from the Transformers library. In TensorFlow, models can be directly trained using Keras and the fit method.

However, the examples in the documentation are overviews and lack of detail information.

Fine-tuning with native PyTorch/TensorFlow

from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)

The github provides complete codes.

HuggingFace Text classification examples

This folder contains some scripts showing examples of text classification with the hugs Transformers library.

run_text_classification.py is the example for text classification fine-tuning for TensorFlow.

However, this is not simple nor straightforward as it is intended to be generic and all-purpose usage. Hence there is not a good example for people to get started with, causing the situations where people need to raise questions like this one.

Classification layers

You would see transfer learning (fine tuning) articles explain adding the classification layers on top of the pre-trained base models, so did in the answer.

output = tf.keras.layers.Dense(num_labels, activation='softmax')(output)

However, the huggingface example in the document does not add any classification layers.

from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)

This is because the TFBertForSequenceClassification has already added the layers.

Hugging Face Transformers: Fine-tuning DistilBERT for Binary Classification Tasks

the base DistilBERT model without any specific head on top (as opposed to other classes such as TFDistilBertForSequenceClassification that do have an added classification head).

If you show the Keras model summary, for instance TFDistilBertForSequenceClassification, it shows the Dense and Dropout layers added on top of the base BERT model.

Model: "tf_distil_bert_for_sequence_classification_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
distilbert (TFDistilBertMain multiple                  66362880  
_________________________________________________________________
pre_classifier (Dense)       multiple                  590592    
_________________________________________________________________
classifier (Dense)           multiple                  1538      
_________________________________________________________________
dropout_59 (Dropout)         multiple                  0         
=================================================================
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0

Freezing base model parameters

There are a few discussions e.g. Fine Tune BERT Models but apparently the way of Huggingface is not freeze the base model parameters. As shown the Keras model summary abobe Non-trainable params: 0.

To freeze the base distilbert layer.

for _layer in model:
    if _layer.name == 'distilbert':
        print(f"Freezing model layer {_layer.name}")
        _layer.trainable = False
    print(_layer.name)
    print(_layer.trainable)
---
Freezing model layer distilbert
distilbert
False      <----------------
pre_classifier
True
classifier
True
dropout_99
True

Resource

Other resources to look into is Kaggle. Search with keyword "huggingface" "BERT" and you will find the working codes published for the competitions.

Upvotes: 0

konstantin_doncov

Reputation: 2879

There are really not many good examples of HuggingFace transformers with the custom dataset files.

Let's import the required libraries first:

import numpy as np
import pandas as pd

import sklearn.model_selection as ms
import sklearn.preprocessing as p

import tensorflow as tf
import transformers as trfs

And define the needed constants:

# Max length of encoded string(including special tokens such as [CLS] and [SEP]):
MAX_SEQUENCE_LENGTH = 64 

# Standard BERT model with lowercase chars only:
PRETRAINED_MODEL_NAME = 'bert-base-uncased' 

# Batch size for fitting:
BATCH_SIZE = 16 

# Number of epochs:
EPOCHS=5

Now it's time to read the dataset:

df = pd.read_csv('data.csv')

Then define the required model from pretrained BERT for sequence classification:

def create_model(max_sequence, model_name, num_labels):
    bert_model = trfs.TFBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
    
    # This is the input for the tokens themselves(words from the dataset after encoding):
    input_ids = tf.keras.layers.Input(shape=(max_sequence,), dtype=tf.int32, name='input_ids')

    # attention_mask - is a binary mask which tells BERT which tokens to attend and which not to attend.
    # Encoder will add the 0 tokens to the some sequence which smaller than MAX_SEQUENCE_LENGTH, 
    # and attention_mask, in this case, tells BERT where is the token from the original data and where is 0 pad token:
    attention_mask = tf.keras.layers.Input((max_sequence,), dtype=tf.int32, name='attention_mask')
    
    # Use previous inputs as BERT inputs:
    output = bert_model([input_ids, attention_mask])[0]

    # We can also add dropout as regularization technique:
    #output = tf.keras.layers.Dropout(rate=0.15)(output)

    # Provide number of classes to the final layer:
    output = tf.keras.layers.Dense(num_labels, activation='softmax')(output)

    # Final model:
    model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output)
    return model

Now we need to instantiate the model using defined function, and compile our model:

model = create_model(MAX_SEQUENCE_LENGTH, PRETRAINED_MODEL_NAME, df.target.nunique())

opt = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

Create a function for the tokenization(converting text to tokens):

def batch_encode(X, tokenizer):
    return tokenizer.batch_encode_plus(
    X,
    max_length=MAX_SEQUENCE_LENGTH, # set the length of the sequences
    add_special_tokens=True, # add [CLS] and [SEP] tokens
    return_attention_mask=True,
    return_token_type_ids=False, # not needed for this type of ML task
    pad_to_max_length=True, # add 0 pad tokens to the sequences less than max_length
    return_tensors='tf'
)

Load the tokenizer:

tokenizer = trfs.BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)

Split the data into train and validation parts:

X_train, X_val, y_train, y_val = ms.train_test_split(df.text.values, df.category_index.values, test_size=0.2)

Encode our sets:

X_train = batch_encode(X_train)
X_val = batch_encode(X_val)

Finally, we can fit our model using train set and validate after each epoch using validation set:

model.fit(
    x=X_train.values(),
    y=y_train,
    validation_data=(X_val.values(), y_val),
    epochs=EPOCHS,
    batch_size=BATCH_SIZE
)

Upvotes: 7

Dr. Fabien Tarrade

Reputation: 1706

You need to transform your input data in the tf.data format with the expected schema so you can first create the features and then train your classification model.

If you look at the glue datasets which are coming for tensorflow_datasetslinkyou will see that the data have a specific schema:

dataset_ops.get_legacy_output_classes(data['train'])

{'idx': tensorflow.python.framework.ops.Tensor,
 'label': tensorflow.python.framework.ops.Tensor,
 'sentence': tensorflow.python.framework.ops.Tensor}

such schema is expected if you want to use convert_examples_to_features to prepare the data ready to be injected in your model.

Transforming the data is not as straigtforward as with pandas for example and it will heavily depend of the structure of your input data.

For example you can find here a step by step do do such transformation. This can be done using tf.data.Dataset.from_generator.

Upvotes: 0

Training TFBertForSequenceClassification with custom X and Y data

Answers (4)

Fine Tuning Approach

1st approach

2nd approach

Implementation of the 2nd approach

3rd approach

Basics

Tokenizer

input_ids

attention_mask

Special Tokens

[CLS]

Vector size

Base Model - TFDistilBertModel

Implementation

Python modules

Configuration

Tokenizer

Input layer

Base model layer

Classification layers

Softmax Layer

Final Custom Model

Data allocation

Train

Saving the model

Experiments

Related

Config file

Fine-Tuning (Transfer Learning)

Classification layers

Freezing base model parameters

Resource

Related Questions