Reputation: 47
As far as I understand it, in Bert's operating logic, he changes 50% of his sentences that he takes as input. It doesn't touch the rest.
1-) Is the changed part the transaction made with tokenizer.encoder? And is this equal to input_ids?
Then padding is done. Creating a matrix according to the specified Max_len. the empty part is filled with 0.
After these, cls tokens are placed per sentence. Sep token is placed at the end of the sentence.
2-) Is input_mask happening in this process?
3 -) In addition, where do we use input_segment?
Upvotes: 0
Views: 2125
Reputation: 7369
input_mask
obtained by encoding the sentences does not show the presence of [MASK]
tokens. Instead, when the batch of sentences are tokenized, prepended with [CLS]
, and appended with [SEP]
tokens, it obtains an arbitrary length.To make all the sentences in the batch has fixed number of tokens, zero padding is performed. The input_ids
variable shows whether a given token position contians actual token or if it a zero padded position.
Using [MASK]
token is using only if you want to train on Masked Language Model(MLM) objective.
BERT is trained on two objectives, MLM and Next Sentence Prediction(NSP). In NSP, you pass two sentences and try to predict if the second sentence is the following sentence of first sentence or not. segment_id
holds the information if of which sentence a particular token belongs to.
Upvotes: 1