Reputation: 1117
I am making a sentence classification model and using BERT word embeddings in it. Due to very large dataset, I combined all the sentences together in one string and made embeddings on the tokens generated from those.
s = " ".join(text_list)
len(s)
Here s
is the string and text_list
contains the sentences on which I want to make my word embeddings.
I then tokenize the string
stokens = tokenizer.tokenize(s)
My question is, will BERT perform better on whole sentence given at a time or making embeddings on tokens for whole string is also fine?
Here is the code for my embedding generator
pool = []
all = []
i=0
while i!=600000:
stokens = stokens[i:i+500]
stokens = ["[CLS]"] + stokens + ["[SEP]"]
input_ids = get_ids(stokens, tokenizer, max_seq_length)
input_masks = get_masks(stokens, max_seq_length)
input_segments = get_segments(stokens, max_seq_length)
a, b= embedd(input_ids, input_masks, input_segments)
pool.append(a)
all.append(b)
print(i)
i+=500
What essentially I am doing here is, I have the string length of 600000 and I take 500 tokens at a time and generate embdedings for it and append it in a list call pool
.
Upvotes: 1
Views: 1449
Reputation: 7369
For classification, you don't have to concatenate the sentences. By concatenating, you are merging the sentences of different classes.
If it is BERT fine-tuning, by default, for the classification task a logistic regression layer is learnt on top of [CLS]
token. Since, its attention based transformer model, it assumes that each token has seen the other tokens and has captured the context. Thus [CLS]
token is sufficient.
However, if you want to use the embeddings, you can learn a classifier on single vector,i.e, embeddings [CLS]
token or averaged embeddings of all the tokens. Or, you can get the embeddings for each token and form a sequence to learn it using other classifiers such as CNN or RNN.
Upvotes: 1