How to get XML from DOC (not DOCX)?

Question

For a DOCX document I do:

document = zipfile.ZipFile(path)
soup = BeautifulSoup(document.read('word/document.xml'), 'html.parser')

How to do this for DOC document?

kjhughes · Accepted Answer

You don't.

DOCX are tough enough to process, and they're XML-based and documented by international standards organizations. DOC files are binary and proprietary.

Don't try to process DOC files directly. Convert them to DOCX first.

See:

How to get XML from DOC (not DOCX)?

Answers (1)

You don't.

Related Questions