Amit
Amit

Reputation: 6304

Train Model with Token Features

I want to train a BERT like model for Hebrew, where fore very word I know:

  1. Lemma
  2. Gender
  3. Number
  4. Voice

And I would like to train a model where for each token these features are concatenated Embedding(Token) = E1(Lemma):E2(Gender):E3(Number):E4(Voice)

Is there a way to do such a thing with the current huggingface transformers library?

Upvotes: 1

Views: 113

Answers (1)

Jindřich
Jindřich

Reputation: 11213

Models in the Huggingface's Transformers do not support factored inputs by default. As a workaround, you can embed the inputs yourself and bypass the embedding layer in BERT. Instead of providing the input_ids when you call the model, you can provide input_embeds. It will use the provided embeddings and the position embeddings to them. Note that the provided embeddings need to have the same dimension as the rest of the model.

You need to have one embedding layer per input type (lemma, gender, number, voice), which also means having factor-specific vocabularies that will assign indices to the inputs that are used for the embedding lookup. It makes sense to have a larger embedding for lemmas than for the grammatical categories that have several possible values.

Then you just concatenate the embeddings, optionally project them and feed them as input_embeds to the model.

Upvotes: 1

Related Questions