Pytorch: How to implement nested transformers: a character-level transformer for words and a word-level transformer for sentences?

Question

I have a model in mind, but I'm having a hard time figuring out how to actually implement it in Pytorch, especially when it comes to training the model (e.g. how to define mini-batches, etc.). First of all let me quickly introduce the context:

I'm working on VQA (visual question answering), in which the task is to answer questions about images, for example:

So, letting aside many details, I just want to focus here on the NLP aspect/branch of the model. In order to process the natural language question, I want to use character-level embeddings (instead of traditional word-level embeddings) because they are more robust in the sense that they can easily accommodate for morphological variations in words (e.g. prefixes, suffixes, plurals, verb conjugations, hyphens, etc.). But at the same time I don't want to lose the inductive bias of reasoning at the word level. Therefore, I came up with the following design:

As you can see in the picture above, I want to use transformers (or even better, universal transformers), but with a little twist. I want to use 2 transformers: the first one will process each word characters in isolation (character-level transformer) to produce an initial word-level embedding for each word in the question. Once we have all these initial word-level embeddings, a second word-level transformer will refine these embeddings to enrich their representation with context, thus obtaining context-aware word-level embeddings.

The full model for the whole VQA task obviously is more complex, but I just want to focus here on this NLP part. So my question is basically about which Pytorch functions should I pay attention to when implementing this. For example, since I'll be using character-level embeddings I have to define a character-level embedding matrix, but then I have to perform lookups on this matrix to generate the inputs for the character-level transformer, repeat this for each word in the question and then feed all these vectors into the word-level transformer. Moreover, words in a single question can have different lengths, and questions within a single mini-batch can have different lengths too. So in my code I have to somehow account for different lengths at the word and the question level simultaneously in a single mini-batch (during training), and I've got no idea how to do that in Pytorch or whether it's even possible at all.

Any tips on how to go about implementing this in Pytorch that could lead me in the right direction will be deeply appreciated.

Pytorch: How to implement nested transformers: a character-level transformer for words and a word-level transformer for sentences?

Answers (1)

Related Questions