Reputation: 1254
For a DOCX document I do:
document = zipfile.ZipFile(path)
soup = BeautifulSoup(document.read('word/document.xml'), 'html.parser')
How to do this for DOC document?
Upvotes: 2
Views: 1901
Reputation: 111611
DOCX are tough enough to process, and they're XML-based and documented by international standards organizations. DOC files are binary and proprietary.
Don't try to process DOC files directly. Convert them to DOCX first.
See:
Upvotes: 3