Python: Extract MS Word data in a tree structure

Question

Is there any way I can extract MS Word file data in a tree structure. What I mean by is that the document file has heading, paragraph, and table. I'd like to extract that information in the hierarchy of headings. Not sure what's the best approach. Can anyone share their experience of parsing a word document with python?

scanny · Accepted Answer

Headings, or "section headings" in printing parlance, are not container objects in Word; they are each simply a paragraph object with formatting that causes them to appear as a section heading, often a bold and somewhat bigger font than the body text.

So whatever approach you take, there is a certain possibility of missing a "boundary" that a reader would perceive.

The best approach depends a bit on the documents you'll be working with. In the best case, each section is started with a paragraph having one of the Heading {n} styles, like "Heading 1" and "Heading 2". Then you can just proceed through the paragraphs checking each for one of those styles and populate your hierarchy accordingly. There are good reasons why an author might stick to this discipline because it makes forming a table-of-contents (TOC) much easier.

Otherwise you'll need to look for other reliable markers indicating the start of a new section.

Note that Word also has a concept of "section" which is quite different than how I'm using the word here. In Word, a section is a contiguous block of pages that share the same page format (like margins, portrait/landscape, etc.). In publishing parlance, a section is a subdivision of a chapter or similar block that has a heading (but generally not a page break) and may itself be divided into sub-sections each level with a smaller heading.

Python: Extract MS Word data in a tree structure

Answers (1)

Related Questions