Reputation: 222
I want to use one of the models for sequence classification provided by huggingface.It seems they are providing a function called
glue_convert_examples_to_features()
for preparing the data so that it can be input into the models.
However, it seems this conversion function only applies to the glue dataset. I can't find an easy solution to apply the conversion to my costum data. Am I overseen a prebuilt function like above ? What would be an easy way to convert my custom data with one sequence and two labels into the format the model expects ?
Upvotes: 2
Views: 570
Reputation: 181
Huggingface added a fine-tuning with custom datasets guide that contains a lot of useful information. I was able to use the information in the IMDB sequence classification section to successfully adapt a notebook using a glue dataset with my own pandas dataframe.
from transformers import (
AutoConfig,
AutoTokenizer,
TFAutoModelForSequenceClassification,
AdamW
)
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
df = pd.read_pickle('data.pkl')
train_texts = df.text.values # an array of strings
train_labels = df.label.values # an array of integers
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)
train_encodings = tokenizer(train_texts.tolist(), truncation=True, max_length=96, padding=True)
val_encodings = tokenizer(val_texts.tolist(), truncation=True, max_length=96, padding=True)
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(train_encodings),
train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
dict(val_encodings),
val_labels
))
num_labels = 3
num_train_examples = len(train_dataset)
num_dev_examples = len(val_dataset)
train_dataset = train_dataset.shuffle(100).batch(train_batch_size)
val_dataset = val_dataset.shuffle(100).batch(eval_batch_size)
learning_rate = 2e-5
train_batch_size = 8
eval_batch_size = 8
num_epochs = 1
train_steps_per_epoch = int(num_train_examples / train_batch_size)
dev_steps_per_epoch = int(num_dev_examples / eval_batch_size)
config = AutoConfig.from_pretrained(model_name, num_labels=num_labels)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name, config=config)
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]
model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
history = model.fit(train_dataset,
epochs=num_epochs,
steps_per_epoch=train_steps_per_epoch,
validation_data=val_dataset,
validation_steps=dev_steps_per_epoch)
Notebook credits: digitalepidemiologylab covid-twitter-bert colab
Upvotes: 1