Sven Tenscher
Sven Tenscher

Reputation: 195

Converting a word document containing mixed fonts to unicode

I've a word document containing different fonts and different languages. One instance would be a text in english and a corresponding translation in ancient greek. For the ancient greek part a TrueType Font was used (https://fonts2u.com/greek-regular.font). Now this approach is highly unsuitable for sharing those files and I'd like to convert the ancient greek part into corresponding unicode characters.

I tried the python package python-docx to import the file. Although successfull at importing and viewing the file content, I couldn't find a way to select only the ancient greek characters and convert them to their corresponding unicode characters.

I was thinking about using the TrueType Font character map and find and replace those characters with the corresponding unicode characters. However viewing the contents I was unable to select only ancient greek characters.

Q: Is there a way using VBA, python or exporting the files in different encodings to "translate" the ancient greek characters to their corresponding unicode characters?

Upvotes: 2

Views: 1806

Answers (2)

Sven Tenscher
Sven Tenscher

Reputation: 195

using the python-docx package I'm searching and selecting the characters based on their font name

import docx
doc = docx.Document('greek_text.docx')
doc.paragraphs[3].runs[10].font.name

returns 'Greek' for example

for run in doc.paragraphs[3].runs:
    if run.font.name == "Greek":
        for char in run.text:
            print (char +" "+ str(hex(ord(char))))

g 0x67

u 0x75

n 0x6e

» 0xbb

returns the unicode character and corresponding hex value. That leaves the mapping of these values to the correct unicode values for their greek characters.

Upvotes: 2

Sam Mason
Sam Mason

Reputation: 16174

wow, sounds awkward, compounded brokenness!

given that the font is using its own non-standard definition of character encoding, you might be easier off using an XML parser to work with the file directly. I'd do this mostly because that's the way that occurs to me to be easiest to select the relevant mis-encoded parts of the text.

something like:

  1. open the file (ElementTree in Python). note that a DOCX file is really a ZIP file containing a file called word/document.xml and other associated images/misc as appropriate
  2. use an xpath selector to get all instances of text that use the greek font
  3. use your remapping code to move from the broken greek encoding to use real Unicode characters
  4. save the file

you'd want to switch to a font that uses greek characters in their standard unicode code points, you could either do this in the raw XML, or maybe just reopen the file in Word and set a font everywhere

Upvotes: 1

Related Questions