Reputation: 706
Just wondering if anybody has experience with reading in Microsoft Word documents and deleting certain paragraphs and blocks programatically (based on headings)
Does anybody know of any libraries that could do this in one of the languages I'm comfortable in :
I've googled a few and most seem to be able to read and write documents (and their parts), but iterating a list of the current headings doesn't seem to be covered. if I could get a list as an object (or something like that), then I could remove specifically what I want.
The main purpose for this is I have a large template doc with lots of information but only certain parts are required, pick and choose for each document, so I intend to build a small frontend to generate these docs on the fly.
as you'll see in the image above, deleting the "Mutts" heading 2 item will delete everything within the red box, and if this were possible using any pre-written libraries that would be amazing and I wouldn't have to dig into the XML.
I'd also prefer not to have to use the COM (Component Object Model) if at all possible, but if it comes to that I'll probably use the Python for Windows Extensions.
Any help that you guys could provide is very much appreciated.
Upvotes: 3
Views: 151
Reputation: 9008
I'm posting this as an answer because there's too much information for a comment. With that in mind, this won't really answer your question. For a word document that looks like:
Stuff
Stuff
Stuff
Other stuff
Other stuff
Other stuff
The resulting xml, stripped of attributes and unnecessary elements, looks something like:
<?xml encoding="UTF-8"?>
<w:document>
<w:body>
<w:p>
<w:pPr>
<w:pStyle w:val="Heading1"/>
</w:pPr>
<w:r>
<w:t>Heading 1</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>Stuff</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>Stuff</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>Stuff</w:t>
</w:r>
</w:p>
<w:p/>
<w:p>
<w:pPr>
<w:pStyle w:val="Heading1"/>
</w:pPr>
<w:r>
<w:t>Heading 2</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>Other stuff</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>Other stuff</w:t>
</w:r>
</w:p>
<w:p>
<w:r>
<w:t>Other stuff</w:t>
</w:r>
</w:p>
</w:body>
</w:document>
So, the "contents" below each heading aren't really contained inside the heading. None of the APIs that I've used are very useful for iterating over existing documents. Even if you could retrieve a list of the headers, you would need to be grab all of the paragraphs between that header and the next header. That being said, I'm hesitant to think there's a good library out there for doing this.
I've used Python's docx
module to create documents before and it took some ramp-up time. In general, you might want to consider an additive method (creating the headers you need) rather than a subtractive method (removing the header you don't need). Also, FYI, it's possible to explore .docx
files by renaming them to .zip
.
Upvotes: 1