PeCaDe
PeCaDe

Reputation: 406

setfit training with a pandas dataframe

I would like to train a zero shot classifier on an annotated sample dataset.

I am following some tutorials but as all use their own data and the same pretarined model, I am trying to confirm: Is this the best approach?

Data example: 

import pandas as pd
from datasets import Dataset
    
# Sample feedback data, it will have 8 samples per label
feedback_dict = [
    {'text': 'The product is great and works well.', 'label': 'Product Performance'},
    {'text': 'I love the design of the product.', 'label': 'Product Design'},
    {'text': 'The product is difficult to use.', 'label': 'Usability'},
    {'text': 'The customer service was very helpful.', 'label': 'Customer Service'},
    {'text': 'The product was delivered on time.', 'label': 'Delivery Time'}
]

# Create a DataFrame with the feedback data
df = pd.DataFrame(feedback_dict)

# convert to Dataset format
df = Dataset.from_pandas(df)

By having the previous data format, this is the approach for model finetunning:

from setfit import SetFitModel, SetFitTrainer

# Select a model
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# training with Setfit
trainer = SetFitTrainer(
    model=model,
    train_dataset=df, # to keep the code simple I do not create the df_train
    eval_dataset=df, # to keep the code simple I do not create the df_eval
    column_mapping={"text": "text", "label": "label"} 
)

trainer.train()

The issue here is that the process never ends after more than 500 hours in a laptop, and the dataset it is only about 88 records with 11 labels.

Upvotes: 4

Views: 1138

Answers (2)

Maciej Skorski
Maciej Skorski

Reputation: 3354

Nothing wrong with your code but you need more powerful machine possibly with GPU to train Transformers. They are not for the poor :-) Try on Colab, Kaggle for free or on a private VM if you have a chance. It takes few seconds to make few epochs.

I am sharing a Colab Notebook here and here is how the performance and resources look like:

enter image description here

My advice would be to utilize free Kaggle Notebooks with GPU, slower than Colab (by a factor of about 4x in my experience) but more generous in terms of availability and time limits. Here is the Kaggle Notebook too for comparison and play.

Happy GPU training!

Upvotes: -1

SilentCloud
SilentCloud

Reputation: 1985

I tried to run the example you posted on Google Colab, it took 37 seconds to run the training.

Here's you code with some tweak to make it work on Colab:

### Install libraries
%%capture
!pip install datasets setfit

After installing the libraries, run the following code:

### Import dataset
import pandas as pd
from datasets import Dataset
# Sample feedback data, it will have 8 samples per label
feedback_dict = [
    {'text': 'The product is great and works well.', 'label': 'Product Performance'},
    {'text': 'I love the design of the product.', 'label': 'Product Design'},
    {'text': 'The product is difficult to use.', 'label': 'Usability'},
    {'text': 'The customer service was very helpful.', 'label': 'Customer Service'},
    {'text': 'The product was delivered on time.', 'label': 'Delivery Time'}
]
# Create a DataFrame with the feedback data
df = pd.DataFrame(feedback_dict)
# convert to Dataset format
df = Dataset.from_pandas(df)

### Run training
from setfit import SetFitModel, SetFitTrainer
# Select a model
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
# training with Setfit
trainer = SetFitTrainer(
    model=model,
    train_dataset=df, # to keep the code simple I do not create the df_train
    eval_dataset=df, # to keep the code simple I do not create the df_eval
    column_mapping={"text": "text", "label": "label"} 
)
trainer.train()

And finally, you can download the trained model on drive and then download it on you PC manually.

### Download model to drive
from google.colab import drive
drive.mount('/content/drive')
trainer.model._save_pretrained('/content/drive/path/to/target/folder')

If your main issue is the training time, this should fix it.

Upvotes: 4

Related Questions