Reputation: 527
I already have implemented HTML to DOCX in Python where I have parsed HTML using BeautifulSoup. I traversed each and every HTML tag recursively and then by using Python-Docx library, I created Docx document.
Now I want to do the reverse thing and convert Docx to HTML string. I read about reading existing document by using Python Docx library (https://python-docx.readthedocs.io/en/latest/user/documents.html). However, I could not find an approach to traverse each document object and convert them into HTML string.
Is there any way where I can do such reverse parsing? I have tried libraries https://pypi.org/project/docx2html/ and https://pypi.org/project/mammoth/. However, I found them ignoring some styles and I would like to write the code on my self instead of using the library.
Any help is greatly appreciated.
Upvotes: 5
Views: 2747
Reputation: 702
Here solution for converting DOCX to HTML through Windows COM (OLE) MS Office interface:
import win32com.client
import win32com.client.dynamic
class WordSaveFormat:
wdFormatNone = None
wdFormatHTML = 8
class WordOle:
def __init__(self, filename):
self.filename = filename
self.word_app = win32com.client.dynamic.Dispatch("Word.Application")
self.word_doc = self.word_app.Documents.Open(filename)
def save(self, new_filename=None, word_save_format=WordSaveFormat.wdFormatNone):
if new_filename:
self.filename = new_filename
self.word_doc.SaveAs(new_filename, word_save_format)
else:
self.word_doc.Save()
def close(self):
self.word_doc.Close(SaveChanges=0)
# self.word_app.DoClose( SaveChanges = 0 )
# self.word_app.Close()
del self.word_app
def show(self):
self.word_app.Visible = 1
def hide(self):
self.word_app.Visible = 0
word_ole = WordOle("D:\\TestDoc.docx")
word_ole.show()
word_ole.save("D:\\TestDoc.html", WordSaveFormat.wdFormatHTML)
# word_ole.save( "D:\\TestDoc2.docx", WordSaveFormat.wdFormatNone )
word_ole.close()
Upvotes: 4