Microsoft Word automation - deleting headings (and their sub-info) - programatically

Question

Just wondering if anybody has experience with reading in Microsoft Word documents and deleting certain paragraphs and blocks programatically (based on headings)

Does anybody know of any libraries that could do this in one of the languages I'm comfortable in :

Python
PHP
C#
Java

I've googled a few and most seem to be able to read and write documents (and their parts), but iterating a list of the current headings doesn't seem to be covered. if I could get a list as an object (or something like that), then I could remove specifically what I want.

The main purpose for this is I have a large template doc with lots of information but only certain parts are required, pick and choose for each document, so I intend to build a small frontend to generate these docs on the fly.

as you'll see in the image above, deleting the "Mutts" heading 2 item will delete everything within the red box, and if this were possible using any pre-written libraries that would be amazing and I wouldn't have to dig into the XML.

I'd also prefer not to have to use the COM (Component Object Model) if at all possible, but if it comes to that I'll probably use the Python for Windows Extensions.

Any help that you guys could provide is very much appreciated.

Jared Goguen · Accepted Answer

I'm posting this as an answer because there's too much information for a comment. With that in mind, this won't really answer your question. For a word document that looks like:

Heading 1

Stuff

Heading 2

Other stuff

The resulting xml, stripped of attributes and unnecessary elements, looks something like:



    
        
            
                
            
            
                Heading 1
            
        
        
            
                Stuff
            
        
        
            
                Stuff
            
        
        
            
                Stuff
            
        
        
        
            
                
            
            
                Heading 2
            
        
        
            
                Other stuff
            
        
        
            
                Other stuff
            
        
        
            
                Other stuff

So, the "contents" below each heading aren't really contained inside the heading. None of the APIs that I've used are very useful for iterating over existing documents. Even if you could retrieve a list of the headers, you would need to be grab all of the paragraphs between that header and the next header. That being said, I'm hesitant to think there's a good library out there for doing this.

I've used Python's docx module to create documents before and it took some ramp-up time. In general, you might want to consider an additive method (creating the headers you need) rather than a subtractive method (removing the header you don't need). Also, FYI, it's possible to explore .docx files by renaming them to .zip.

Microsoft Word automation - deleting headings (and their sub-info) - programatically

Answers (1)

Heading 1

Heading 2

Related Questions