Creating input data for BERT modelling - multiclass text classification

Question

I'm trying to build a keras model to classify text for 45 different classes. I'm a little confused about preparing my data for the input as required by google's BERT model.

Some blog posts insert data as a tf dataset with input_ids, segment ids, and mask ids, as in this guide, but then some only go with input_ids and masks, as in this guide.

Also in the second guide, it notes that the segment mask and attention mask inputs are optional.

Can anyone explain whether or not those two are required for a multiclass classification task?

If it helps, each row of my data can consist of any number of sentences within a reasonably sized paragraph. I want to be able to classify each paragraph/input to a single label.

I can't seem to find many guides/blogs about using BERT with Keras (Tensorflow 2) for a multiclass problem, indeed many of them are for multi-label problems.

Meet · Accepted Answer

I guess it is too late to answer but I had the same question. I went through huggingface code and found that if attention_mask and segment_type ids are None then by default it pays attention to all tokens and all the segments are given id 0.

If you want to check it out, you can find the code here

Let me know if this clarifies it or you think otherwise.

Creating input data for BERT modelling - multiclass text classification

Answers (1)

Related Questions