ganto
ganto

Reputation: 222

How do I set up a custom input-pipeline for sequence classification for the huggingface transformer models?

I want to use one of the models for sequence classification provided by huggingface.It seems they are providing a function called glue_convert_examples_to_features() for preparing the data so that it can be input into the models.

However, it seems this conversion function only applies to the glue dataset. I can't find an easy solution to apply the conversion to my costum data. Am I overseen a prebuilt function like above ? What would be an easy way to convert my custom data with one sequence and two labels into the format the model expects ?

Upvotes: 2

Views: 570

Answers (1)

chris
chris

Reputation: 181

Huggingface added a fine-tuning with custom datasets guide that contains a lot of useful information. I was able to use the information in the IMDB sequence classification section to successfully adapt a notebook using a glue dataset with my own pandas dataframe.

from transformers import (
   AutoConfig,
   AutoTokenizer,
   TFAutoModelForSequenceClassification,
   AdamW
)
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split

model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

df = pd.read_pickle('data.pkl')

train_texts = df.text.values  # an array of strings
train_labels = df.label.values  # an array of integers

train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

train_encodings = tokenizer(train_texts.tolist(), truncation=True, max_length=96, padding=True)
val_encodings = tokenizer(val_texts.tolist(), truncation=True, max_length=96, padding=True)

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
))

num_labels = 3
num_train_examples = len(train_dataset)
num_dev_examples = len(val_dataset)

train_dataset = train_dataset.shuffle(100).batch(train_batch_size)
val_dataset = val_dataset.shuffle(100).batch(eval_batch_size)

learning_rate = 2e-5
train_batch_size =  8
eval_batch_size = 8
num_epochs = 1

train_steps_per_epoch = int(num_train_examples / train_batch_size)
dev_steps_per_epoch = int(num_dev_examples / eval_batch_size)

config = AutoConfig.from_pretrained(model_name, num_labels=num_labels)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, config=config)

optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]

model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

history = model.fit(train_dataset, 
  epochs=num_epochs,
  steps_per_epoch=train_steps_per_epoch,
  validation_data=val_dataset,
  validation_steps=dev_steps_per_epoch)

Notebook credits: digitalepidemiologylab covid-twitter-bert colab

Upvotes: 1

Related Questions