Reputation: 51
Im pretraining trasformer with my own unlabeled data like this:
python train_mlm.py sentence-transformers/LaBSE train.txt
Based on https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning/MLM
Then I want to get embeddings for setnences. Code:
model = AutoModelForMaskedLM.from_pretrained('output/sentence-transformers_LaBSE-2021-12-28_13-03-20')
tokenizer = AutoTokenizer.from_pretrained('output/sentence-transformers_LaBSE-2021-12-28_13-03-20')
model = model.eval()
english_sentences = [
"dog",
"Puppies are nice.",
"I enjoy taking long walks along the beach with my dog.",
]
encoded_input = tokenizer(english_sentences, padding=True, truncation=True, max_length=64, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
print(model_output[0].shape)
Problem is that shape of my output is someting like (3, 14, 500 000). When without training on my data shape is (3, 14, 768). What I have done wrong? How can I get final embeddings after my training?
Upvotes: 0
Views: 909
Reputation: 148
You pre-trained a transformer on Masked Language Modeling (MLM). That does not mean you have to use the MLM head afterwards: AutoModelForMaskedLM.from_pretrained
; because your downstream task is actually embedding generation given some inputs. This is achieved by only using the base encoder of your fine-tuned model, without any head on top: AutoModel.from_pretrained(...)
. This will return the output shapes you are expecting.
More clarification on the output shapes you are getting:
AutoModel
provides.AutoModelForMaskedLM
provides.Given B = batch size and L = batch sequence length.
Upvotes: 1