Petar
Petar

Reputation: 173

Python translate large texts to English

I am searching for a Python library that translates very large texts to English. I have already used TextBlob (which at some point just stops translating, API limits I suppose), googletrans (which at some point also just stops translating, it also doesn't translate very large texts and I have to split them into pieces and then merge). I am looking for a solution that I can be sure that it won't stop working, since I will be running this code regularly on around 100K texts with average word length of 10K. If anyone has done something similar, I would appreciate your help!

Upvotes: 4

Views: 4577

Answers (3)

Franck Dernoncourt
Franck Dernoncourt

Reputation: 83157

One can use mBART-50 (paper, pre-trained model on Hugging Face). The pre-trained model on Hugging Face is under MIT license.


See below for an example code. Environment that I used on Ubuntu 20.04.5 LTS with an NVIDIA A100 40GB GPU and CUDA 12.0:

conda create --name mbart-python39 python=3.9
conda activate mbart-python39 
pip install transformers==4.28.1
pip install chardet==5.1.0
pip install sentencepiece==0.1.99
pip install protobuf==3.20

Code (>95% of the code below is from from the Hugging Face documentation):

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

article_fr = "Bonjour comment ca va?"
article_en = "What's going on?"

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

# translate French to English
tokenizer.src_lang = "fr_XX"
encoded_fr = tokenizer(article_fr, return_tensors="pt")
generated_tokens = model.generate(**encoded_fr, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
translation =tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(translation)
# => "Hello, how are you?"

# translate English to French
tokenizer.src_lang = "en_XX"
encoded_en = tokenizer(article_en, return_tensors="pt")
generated_tokens = model.generate(**encoded_en, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
translation = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(translation)
# => "Qu'est-ce qui se passe?"

Upvotes: 1

Sggg
Sggg

Reputation: 31

Python library dl-translate gets the job done very well. It is based on Huggingface transformers, with 2 available model options - mBART-50 Large (50 languages, personally I find it to be very accurate) and m2m100 (100 languages, but slightly less accurate). Link to github: https://github.com/xhluca/dl-translate

Upvotes: 3

Ren
Ren

Reputation: 155

the Deepl API allows you to get 500k caracters every month, would this be enought? https://www.deepl.com/en/docs-api/

might not be, but I wanted to be sure

Upvotes: 1

Related Questions