Reputation: 173
I am searching for a Python library that translates very large texts to English. I have already used TextBlob
(which at some point just stops translating, API limits I suppose), googletrans
(which at some point also just stops translating, it also doesn't translate very large texts and I have to split them into pieces and then merge). I am looking for a solution that I can be sure that it won't stop working, since I will be running this code regularly on around 100K texts with average word length of 10K. If anyone has done something similar, I would appreciate your help!
Upvotes: 4
Views: 4577
Reputation: 83157
One can use mBART-50 (paper, pre-trained model on Hugging Face). The pre-trained model on Hugging Face is under MIT license.
See below for an example code. Environment that I used on Ubuntu 20.04.5 LTS with an NVIDIA A100 40GB GPU and CUDA 12.0:
conda create --name mbart-python39 python=3.9
conda activate mbart-python39
pip install transformers==4.28.1
pip install chardet==5.1.0
pip install sentencepiece==0.1.99
pip install protobuf==3.20
Code (>95% of the code below is from from the Hugging Face documentation):
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
article_fr = "Bonjour comment ca va?"
article_en = "What's going on?"
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
# translate French to English
tokenizer.src_lang = "fr_XX"
encoded_fr = tokenizer(article_fr, return_tensors="pt")
generated_tokens = model.generate(**encoded_fr, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
translation =tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(translation)
# => "Hello, how are you?"
# translate English to French
tokenizer.src_lang = "en_XX"
encoded_en = tokenizer(article_en, return_tensors="pt")
generated_tokens = model.generate(**encoded_en, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(translation)
# => "Qu'est-ce qui se passe?"
Upvotes: 1
Reputation: 31
Python library dl-translate gets the job done very well. It is based on Huggingface transformers, with 2 available model options - mBART-50 Large (50 languages, personally I find it to be very accurate) and m2m100 (100 languages, but slightly less accurate). Link to github: https://github.com/xhluca/dl-translate
Upvotes: 3
Reputation: 155
the Deepl API allows you to get 500k caracters every month, would this be enought? https://www.deepl.com/en/docs-api/
might not be, but I wanted to be sure
Upvotes: 1