Gaurav Bagul
Gaurav Bagul

Reputation: 527

Convert DOCX TO HTML Programmatically using Python

I already have implemented HTML to DOCX in Python where I have parsed HTML using BeautifulSoup. I traversed each and every HTML tag recursively and then by using Python-Docx library, I created Docx document.

Now I want to do the reverse thing and convert Docx to HTML string. I read about reading existing document by using Python Docx library (https://python-docx.readthedocs.io/en/latest/user/documents.html). However, I could not find an approach to traverse each document object and convert them into HTML string.

Is there any way where I can do such reverse parsing? I have tried libraries https://pypi.org/project/docx2html/ and https://pypi.org/project/mammoth/. However, I found them ignoring some styles and I would like to write the code on my self instead of using the library.

Any help is greatly appreciated.

Upvotes: 5

Views: 2747

Answers (1)

Rufat
Rufat

Reputation: 702

Here solution for converting DOCX to HTML through Windows COM (OLE) MS Office interface:

import win32com.client
import win32com.client.dynamic


class WordSaveFormat:
    wdFormatNone = None
    wdFormatHTML = 8


class WordOle:
    def __init__(self, filename):
        self.filename = filename
        self.word_app = win32com.client.dynamic.Dispatch("Word.Application")
        self.word_doc = self.word_app.Documents.Open(filename)

    def save(self, new_filename=None, word_save_format=WordSaveFormat.wdFormatNone):
        if new_filename:
            self.filename = new_filename
            self.word_doc.SaveAs(new_filename, word_save_format)
        else:
            self.word_doc.Save()

    def close(self):
        self.word_doc.Close(SaveChanges=0)
        # self.word_app.DoClose( SaveChanges = 0 )
        # self.word_app.Close()
        del self.word_app

    def show(self):
        self.word_app.Visible = 1

    def hide(self):
        self.word_app.Visible = 0


word_ole = WordOle("D:\\TestDoc.docx")
word_ole.show()
word_ole.save("D:\\TestDoc.html", WordSaveFormat.wdFormatHTML)
# word_ole.save( "D:\\TestDoc2.docx", WordSaveFormat.wdFormatNone )
word_ole.close()

Upvotes: 4

Related Questions