sandboxj
sandboxj

Reputation: 1254

How to get XML from DOC (not DOCX)?

For a DOCX document I do:

document = zipfile.ZipFile(path)
soup = BeautifulSoup(document.read('word/document.xml'), 'html.parser')

How to do this for DOC document?

Upvotes: 2

Views: 1901

Answers (1)

kjhughes
kjhughes

Reputation: 111611

You don't.

DOCX are tough enough to process, and they're XML-based and documented by international standards organizations. DOC files are binary and proprietary.

Don't try to process DOC files directly. Convert them to DOCX first.

See:

Upvotes: 3

Related Questions