Reputation: 31
I want to fine-tune on a pre-trained BERT model. However, my task uses data within a specific domain (say biomedical data). Additionally, my data is also in a language different from English (say Dutch).
Now I could fine-tune the Dutch bert-base-dutch-cased pre-trained model. However, how would I go about fine-tuning a Biomedical BERT model, like BioBERT, which is in the correct domain, but wrong language?
I have thought about using NMT, but don't think it's viable and worth the effort. If I fine-tune without any alterations to the model, I fear that the model will not learn the task well since it was pre-trained on a completely different language.
Upvotes: 1
Views: 2262
Reputation: 1
I dont Think it works, even if it does there is a high chance that the BERT will tokenize many dutch words as unknown. So, I'd suggest you to try to finetune with this multilingual BERT model https://huggingface.co/google-bert/bert-base-multilingual-cased
Upvotes: 0
Reputation: 11
Never tried this before, but I believe you can apply task adaptive pretraining (TAPT) on a Dutch BERT model, which means you can pretrain a Dutch BERT model on small biomedical data provided in Dutch to make it augment the general knowledge it has about the Dutch with the specific knowledge of your task (the biomedical task you are interested in).
Upvotes: 0
Reputation: 31
I just want to know if there are any methods that allow for fine-tuning a pre-trained BERT model trained on a specific domain and use it for data within that same domain, but a different language
Probably not. BERT's vocabulary is fixed at the start of pre-training, and adding additional vocabulary leads to random weight initializations.
Instead, I would:
Upvotes: 1