Thomas Kodill
Thomas Kodill

Reputation: 51

Transformers pretraining with MLM problem - sentence embeddings

Im pretraining trasformer with my own unlabeled data like this:

python train_mlm.py sentence-transformers/LaBSE train.txt Based on https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning/MLM

Then I want to get embeddings for setnences. Code:

model = AutoModelForMaskedLM.from_pretrained('output/sentence-transformers_LaBSE-2021-12-28_13-03-20')
tokenizer = AutoTokenizer.from_pretrained('output/sentence-transformers_LaBSE-2021-12-28_13-03-20')

model = model.eval()

english_sentences = [
    "dog",
    "Puppies are nice.",
    "I enjoy taking long walks along the beach with my dog.",
]
encoded_input = tokenizer(english_sentences, padding=True, truncation=True, max_length=64, return_tensors='pt')
with torch.no_grad():
    model_output = model(**encoded_input)

print(model_output[0].shape)

Problem is that shape of my output is someting like (3, 14, 500 000). When without training on my data shape is (3, 14, 768). What I have done wrong? How can I get final embeddings after my training?

Upvotes: 0

Views: 909

Answers (1)

Javier Beltrán
Javier Beltrán

Reputation: 148

You pre-trained a transformer on Masked Language Modeling (MLM). That does not mean you have to use the MLM head afterwards: AutoModelForMaskedLM.from_pretrained; because your downstream task is actually embedding generation given some inputs. This is achieved by only using the base encoder of your fine-tuned model, without any head on top: AutoModel.from_pretrained(...). This will return the output shapes you are expecting.

More clarification on the output shapes you are getting:

  • (B, L, 768) is the expected output of running a Transformer encoder without heads, i.e. as a generator of embeddings. This is because 768 is the usual hidden-layer size of Transformer encoders. This is what AutoModel provides.
  • (B, L, 500000) is the expected output of running a Transformer encoder with a MLM head on top, i.e. as a predictor of masked tokens in a MLM task. Something like 500000 is the vocabulary size, and the logits predicted indicate which vocabulary token is more likely to fill the gap. This is what AutoModelForMaskedLM provides.

Given B = batch size and L = batch sequence length.

Upvotes: 1

Related Questions