Reputation: 195
I've a word document containing different fonts and different languages. One instance would be a text in english and a corresponding translation in ancient greek. For the ancient greek part a TrueType Font was used (https://fonts2u.com/greek-regular.font). Now this approach is highly unsuitable for sharing those files and I'd like to convert the ancient greek part into corresponding unicode characters.
I tried the python package python-docx to import the file. Although successfull at importing and viewing the file content, I couldn't find a way to select only the ancient greek characters and convert them to their corresponding unicode characters.
I was thinking about using the TrueType Font character map and find and replace those characters with the corresponding unicode characters. However viewing the contents I was unable to select only ancient greek characters.
Q: Is there a way using VBA, python or exporting the files in different encodings to "translate" the ancient greek characters to their corresponding unicode characters?
Upvotes: 2
Views: 1806
Reputation: 195
using the python-docx package I'm searching and selecting the characters based on their font name
import docx
doc = docx.Document('greek_text.docx')
doc.paragraphs[3].runs[10].font.name
returns 'Greek' for example
for run in doc.paragraphs[3].runs:
if run.font.name == "Greek":
for char in run.text:
print (char +" "+ str(hex(ord(char))))
g 0x67
u 0x75
n 0x6e
» 0xbb
returns the unicode character and corresponding hex value. That leaves the mapping of these values to the correct unicode values for their greek characters.
Upvotes: 2
Reputation: 16174
wow, sounds awkward, compounded brokenness!
given that the font is using its own non-standard definition of character encoding, you might be easier off using an XML parser to work with the file directly. I'd do this mostly because that's the way that occurs to me to be easiest to select the relevant mis-encoded parts of the text.
something like:
ElementTree
in Python). note that a DOCX file is really a ZIP file containing a file called word/document.xml
and other associated images/misc as appropriateyou'd want to switch to a font that uses greek characters in their standard unicode code points, you could either do this in the raw XML, or maybe just reopen the file in Word and set a font everywhere
Upvotes: 1