Converting a word document containing mixed fonts to unicode

Question

I've a word document containing different fonts and different languages. One instance would be a text in english and a corresponding translation in ancient greek. For the ancient greek part a TrueType Font was used (https://fonts2u.com/greek-regular.font). Now this approach is highly unsuitable for sharing those files and I'd like to convert the ancient greek part into corresponding unicode characters.

I tried the python package python-docx to import the file. Although successfull at importing and viewing the file content, I couldn't find a way to select only the ancient greek characters and convert them to their corresponding unicode characters.

I was thinking about using the TrueType Font character map and find and replace those characters with the corresponding unicode characters. However viewing the contents I was unable to select only ancient greek characters.

Q: Is there a way using VBA, python or exporting the files in different encodings to "translate" the ancient greek characters to their corresponding unicode characters?

Sam Mason · Accepted Answer

wow, sounds awkward, compounded brokenness!

given that the font is using its own non-standard definition of character encoding, you might be easier off using an XML parser to work with the file directly. I'd do this mostly because that's the way that occurs to me to be easiest to select the relevant mis-encoded parts of the text.

something like:

open the file (ElementTree in Python). note that a DOCX file is really a ZIP file containing a file called word/document.xml and other associated images/misc as appropriate
use an xpath selector to get all instances of text that use the greek font
use your remapping code to move from the broken greek encoding to use real Unicode characters
save the file

you'd want to switch to a font that uses greek characters in their standard unicode code points, you could either do this in the raw XML, or maybe just reopen the file in Word and set a font everywhere

Converting a word document containing mixed fonts to unicode

Answers (2)

Related Questions