PeakyBlinder
PeakyBlinder

Reputation: 1117

Should BERT embeddings be made on tokens or sentences?

I am making a sentence classification model and using BERT word embeddings in it. Due to very large dataset, I combined all the sentences together in one string and made embeddings on the tokens generated from those.

s = " ".join(text_list)
len(s)

Here s is the string and text_list contains the sentences on which I want to make my word embeddings.

I then tokenize the string

stokens = tokenizer.tokenize(s)

My question is, will BERT perform better on whole sentence given at a time or making embeddings on tokens for whole string is also fine?

Here is the code for my embedding generator

pool = []
all = []
i=0
while i!=600000:
  stokens = stokens[i:i+500]
  stokens = ["[CLS]"] + stokens + ["[SEP]"]
  input_ids = get_ids(stokens, tokenizer, max_seq_length)
  input_masks = get_masks(stokens, max_seq_length)
  input_segments = get_segments(stokens, max_seq_length)
  a, b= embedd(input_ids, input_masks, input_segments)
  pool.append(a)
  all.append(b)
  print(i)
  i+=500

What essentially I am doing here is, I have the string length of 600000 and I take 500 tokens at a time and generate embdedings for it and append it in a list call pool.

Upvotes: 1

Views: 1449

Answers (1)

Ashwin Geet D'Sa
Ashwin Geet D'Sa

Reputation: 7369

For classification, you don't have to concatenate the sentences. By concatenating, you are merging the sentences of different classes.

If it is BERT fine-tuning, by default, for the classification task a logistic regression layer is learnt on top of [CLS] token. Since, its attention based transformer model, it assumes that each token has seen the other tokens and has captured the context. Thus [CLS] token is sufficient.

However, if you want to use the embeddings, you can learn a classifier on single vector,i.e, embeddings [CLS] token or averaged embeddings of all the tokens. Or, you can get the embeddings for each token and form a sequence to learn it using other classifiers such as CNN or RNN.

Upvotes: 1

Related Questions