Train Model with Token Features

Question

I want to train a BERT like model for Hebrew, where fore very word I know:

Lemma
Gender
Number
Voice

And I would like to train a model where for each token these features are concatenated Embedding(Token) = E1(Lemma):E2(Gender):E3(Number):E4(Voice)

Is there a way to do such a thing with the current huggingface transformers library?

Jindřich · Accepted Answer

Models in the Huggingface's Transformers do not support factored inputs by default. As a workaround, you can embed the inputs yourself and bypass the embedding layer in BERT. Instead of providing the input_ids when you call the model, you can provide input_embeds. It will use the provided embeddings and the position embeddings to them. Note that the provided embeddings need to have the same dimension as the rest of the model.

You need to have one embedding layer per input type (lemma, gender, number, voice), which also means having factor-specific vocabularies that will assign indices to the inputs that are used for the embedding lookup. It makes sense to have a larger embedding for lemmas than for the grammatical categories that have several possible values.

Then you just concatenate the embeddings, optionally project them and feed them as input_embeds to the model.

Train Model with Token Features

Answers (1)

Related Questions