TheDavil
TheDavil

Reputation: 706

Microsoft Word automation - deleting headings (and their sub-info) - programatically

Just wondering if anybody has experience with reading in Microsoft Word documents and deleting certain paragraphs and blocks programatically (based on headings)

Does anybody know of any libraries that could do this in one of the languages I'm comfortable in :

I've googled a few and most seem to be able to read and write documents (and their parts), but iterating a list of the current headings doesn't seem to be covered. if I could get a list as an object (or something like that), then I could remove specifically what I want.

The main purpose for this is I have a large template doc with lots of information but only certain parts are required, pick and choose for each document, so I intend to build a small frontend to generate these docs on the fly.

How I would achieve this in MS Word

as you'll see in the image above, deleting the "Mutts" heading 2 item will delete everything within the red box, and if this were possible using any pre-written libraries that would be amazing and I wouldn't have to dig into the XML.

I'd also prefer not to have to use the COM (Component Object Model) if at all possible, but if it comes to that I'll probably use the Python for Windows Extensions.

Any help that you guys could provide is very much appreciated.

Upvotes: 3

Views: 151

Answers (1)

Jared Goguen
Jared Goguen

Reputation: 9008

I'm posting this as an answer because there's too much information for a comment. With that in mind, this won't really answer your question. For a word document that looks like:


Heading 1

Stuff

Stuff

Stuff

Heading 2

Other stuff

Other stuff

Other stuff


The resulting xml, stripped of attributes and unnecessary elements, looks something like:

<?xml encoding="UTF-8"?>
<w:document>
    <w:body>
        <w:p>
            <w:pPr>
                <w:pStyle w:val="Heading1"/>
            </w:pPr>
            <w:r>
                <w:t>Heading 1</w:t>
            </w:r>
        </w:p>
        <w:p>
            <w:r>
                <w:t>Stuff</w:t>
            </w:r>
        </w:p>
        <w:p>
            <w:r>
                <w:t>Stuff</w:t>
            </w:r>
        </w:p>
        <w:p>
            <w:r>
                <w:t>Stuff</w:t>
            </w:r>
        </w:p>
        <w:p/>
        <w:p>
            <w:pPr>
                <w:pStyle w:val="Heading1"/>
            </w:pPr>
            <w:r>
                <w:t>Heading 2</w:t>
            </w:r>
        </w:p>
        <w:p>
            <w:r>
                <w:t>Other stuff</w:t>
            </w:r>
        </w:p>
        <w:p>
            <w:r>
                <w:t>Other stuff</w:t>
            </w:r>
        </w:p>
        <w:p>
            <w:r>
                <w:t>Other stuff</w:t>
            </w:r>
        </w:p>
    </w:body>
</w:document>

So, the "contents" below each heading aren't really contained inside the heading. None of the APIs that I've used are very useful for iterating over existing documents. Even if you could retrieve a list of the headers, you would need to be grab all of the paragraphs between that header and the next header. That being said, I'm hesitant to think there's a good library out there for doing this.

I've used Python's docx module to create documents before and it took some ramp-up time. In general, you might want to consider an additive method (creating the headers you need) rather than a subtractive method (removing the header you don't need). Also, FYI, it's possible to explore .docx files by renaming them to .zip.

Upvotes: 1

Related Questions