Reputation: 393
I am only concerned with getting the text from .doc files. I am using python 3.6 on windows 10, so textract/antiword are off the table. I looked at other references from this question but they are all old and incompatible with windows 10 and/or python 3.6.
My document is a .doc file with a mix of Chinese and English. I am not familiar with how Word stores its files, and I don't have Word on my machine. Using olefile I was able to get the bytes of the document, but I do not know how to traverse the headers and layout correctly to extract the text. If I naively try
from olefile import OleFileIO as ofio
ole = ofio('d.doc')
stream = ole.openstream('WordDocument')
data = stream.read()
data.decode('utf-16')
>>>UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 9884-9885: illegal encoding
data[9884:9885]
>>>b'\xfa'
data[:9884].decode('utf-16')
Then the last line gives me about half the doc, starting and ending with a lot of garbage characters. I suspect I could keep trying this method to get the text piece-by-piece, but I ultimately need to do this for a lot of files. Even if I did it this way, I can't think of a good way to automate it. How can I reliably get the text from a .doc using olefile?
(Feel free to include alternatives to olefile in your answer as well, if you know of one that would work with my specs)
Upvotes: 2
Views: 3240
Reputation: 24937
I am not sure, but I think that the problem is that olefile has no understanding of Word documents, only OLE "streams". So I would guess that your extracted data has more than plain text in, control characters of some kind. So I guess that's why you can't decode the data you get as UTF-16.
There are Python modules to convert from doc files, but they tend to work only on Linux where they make use of the command line utilities antiword
or catdoc
.
I tried other solutions - if the issue is that you have no license for Word, but can otherwise install software, LibreOffice could be a path forward. With this command, I converted a Word test file with Chinese letters from doc format to HTML:
"c:\Program Files\LibreOffice\program\swriter.exe" --convert-to html d.doc
LibreOffice can also convert to many other formats, but HTML should be simple enough to process further. I also tried a port of catdoc
to Windows but I couldn't get it to handle the Chinese letters.
import win32com.client
app = win32com.client.Dispatch("Word.Application")
try:
app.visible = False
wb = app.Documents.Open('c:/temp/d.doc')
doc = app.ActiveDocument
with open('out.txt', 'w', encoding = 'utf-16') as f:
f.write(doc.Content.Text)
except Exception as e:
print(e)
finally:
app.Quit()
Upvotes: 1