Bearded Guy
Bearded Guy

Reputation: 1

Translate language for multiple pdf to english

I have a set of pdf documents in russian and I need to translate them to English. I need to automate this activity.

Currently I upload the document to Google Translate and get it translated but it takes a lot of time to do this and is not scalable.

Upvotes: -2

Views: 1169

Answers (2)

Adarsh Patil
Adarsh Patil

Reputation: 1

use python get this automated

  1. import all required modules like pymupdf , from deep_translator import GoogleTranslator

  2. code - `

    import fitz # PyMuPDF from deep_translator import GoogleTranslator

    WHITE = fitz.utils.getColor("white") textflags = fitz.TEXT_DEHYPHENATE # Handle hyphenated words to_CH = GoogleTranslator(source="en", target="zh-CN")

    Open the PDF

    doc = fitz.open(r'C:\projects\Trunk\translator\0A1.pdf')

    Create an Optional Content Group (OCG) for translated text

    ocg = doc.add_ocg('Chinese Translation', on=True)

    for page in doc: blocks = page.get_text("blocks", flags=textflags)

     for block in blocks:
         bbox = block[:4]  # Text position (x0, y0, x1, y1)
         text = block[4]  # Extracted text
         translated_text = to_CH.translate(text)  
         # Translate to Chinese
    
         # Remove original text by overlaying a white rectangle
    
         page.draw_rect(bbox, color=None, fill=WHITE, oc=ocg)
    
         # Insert translated text at the same position
    
         page.insert_textbox(bbox, translated_text,fontname="helv", 
               fontsize=10, color=(0, 0, 0), oc=ocg)
    

    Ensure fonts are properly embedded

    doc.subset_fonts()

    Save the translated PDF

    doc.save(r'C:\projects\Trunk\translator\translated_chinese.pdf') print("Translated PDF saved successfully!")`

and you're pdf will get translated

Upvotes: -1

bananabrann
bananabrann

Reputation: 553

(Note: I am unfamiliar with translating documents, but this should get your basic architecture in the right direction).

Based on our brief exchange, I would recommend exploring sort of a process like this:

enter image description here

With that, you would host docs using a SharePoint List, where when a doc is added, a Power Automate flow would trigger and translate then re-write the doc. You could either use Microsoft's in-house extraction and translation software (or Automate steps/actions), or you could send an HTTP request to whatever client you would like. A Google search for translation or text extraction APIs reveals several options, including Google Translate.

If you don't have any requirements to use Google Translate (or something else), I would personally stick with the same brand of tech so that there's less headache of working with an outside client... but of course that's up to whatever your requirements are. You can initiative HTTP requests with the "HTTP" action.

HTTP Requests


Within Power Automate, you would use the "When an item is created" SharePoint trigger, then Encodian's "Extract Text from Image" (or something different depending on your file type). Extract Text from Image

Then, simply take the output and toss it to Microsoft Translate, or an HTTP request to wherever you want Microsoft Translator

You can then write the translated output to wherever you would like, another SharePoint List, a database, email, whatever.

Good Links

Upvotes: -1

Related Questions