Using olefile to extract text from Word .doc

Question

I am only concerned with getting the text from .doc files. I am using python 3.6 on windows 10, so textract/antiword are off the table. I looked at other references from this question but they are all old and incompatible with windows 10 and/or python 3.6.

My document is a .doc file with a mix of Chinese and English. I am not familiar with how Word stores its files, and I don't have Word on my machine. Using olefile I was able to get the bytes of the document, but I do not know how to traverse the headers and layout correctly to extract the text. If I naively try

from olefile import OleFileIO as ofio
ole = ofio('d.doc')
stream = ole.openstream('WordDocument')
data = stream.read()
data.decode('utf-16')
>>>UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 9884-9885: illegal encoding
data[9884:9885]
>>>b'\xfa'
data[:9884].decode('utf-16')

Then the last line gives me about half the doc, starting and ending with a lot of garbage characters. I suspect I could keep trying this method to get the text piece-by-piece, but I ultimately need to do this for a lot of files. Even if I did it this way, I can't think of a good way to automate it. How can I reliably get the text from a .doc using olefile?

(Feel free to include alternatives to olefile in your answer as well, if you know of one that would work with my specs)

Prof. Falken · Accepted Answer

I am not sure, but I think that the problem is that olefile has no understanding of Word documents, only OLE "streams". So I would guess that your extracted data has more than plain text in, control characters of some kind. So I guess that's why you can't decode the data you get as UTF-16.

There are Python modules to convert from doc files, but they tend to work only on Linux where they make use of the command line utilities antiword or catdoc.

I tried other solutions - if the issue is that you have no license for Word, but can otherwise install software, LibreOffice could be a path forward. With this command, I converted a Word test file with Chinese letters from doc format to HTML:

"c:\Program Files\LibreOffice\program\swriter.exe" --convert-to html d.doc

LibreOffice can also convert to many other formats, but HTML should be simple enough to process further. I also tried a port of catdoc to Windows but I couldn't get it to handle the Chinese letters.

Too bad you don't have Word installed, or you could have made it do the work for you. Leaving that solution here in case someone else has use for it:

import win32com.client

app = win32com.client.Dispatch("Word.Application")

try:
    app.visible = False
    wb = app.Documents.Open('c:/temp/d.doc')
    doc = app.ActiveDocument

    with open('out.txt', 'w', encoding = 'utf-16') as f:
        f.write(doc.Content.Text)

except Exception as e:
    print(e)

finally:
    app.Quit()

Using olefile to extract text from Word .doc

Answers (1)

Related Questions