user736319
user736319

Reputation: 21

How to search/replace text in WordprocessingML

In WordprocessingML (the format MS Word documents saves in), is there anyway to search through the text easily?

The main problem I run into is that WordprocessingML format break down each paragraph into "runs", for example:

To store the sentence "Module 1: Some Section Title", WordprocessingML specifies the XML markup to be:

  <w:p w:rsidR="00F9529C" w:rsidRDefault="00F9529C" w:rsidP="00F9529C">
   <w:pPr>
    <w:pStyle w:val="Heading1_5019"/>
   </w:pPr>
   <w:bookmarkStart w:id="0" w:name="_Toc247333659"/>
   <w:r>
    <w:t>M</w:t>
   </w:r>
   <w:r w:rsidRPr="007D2739">
    <w:t xml:space="preserve">odule 1: </w:t>
   </w:r>
   <w:r>
    <w:t>Some Section Title</w:t>
   </w:r>
   <w:bookmarkEnd w:id="0"/>
  </w:p>

As you can see, the sentence was broken into "M", "odule 1:", "Some Section Title". This arrangement make it impossible to search for the sentence as a whole. Is there anyway to get around this?

To clarify, I am trying to do this in PHP using DomDocument.

Upvotes: 2

Views: 1182

Answers (2)

Eric White
Eric White

Reputation: 1891

I've written some example code that shows how to search and replace text in an Open XML WordprocessingML document. My approach is: once you have found a paragraph that contains text that needs to be replaced, you break up all runs in the paragraph into runs of a single character. It then is straightforward to find the set of consecutive runs that match your search string. You can then create a new run with the replacement text, and then delete the single character runs that match the search string. I've implemented this using XML DOM (using System.Xml.XmlDocument). You can find example code in a blog post, Search and Replace Text in an Open XML WordprocessingML document. In addition, I've recorded a short screen-cast that shows how the algorithm works: http://www.youtube.com/watch?v=w128hJUu3GM

Upvotes: 1

DarinH
DarinH

Reputation: 4889

Yep, that's the pain of working directly with WordML, vs say, using the word object model.

Unfortunately, I've found nothing that eases that (the openxml sdk, Aspose, etc all appear to essentially just wrap the WordML xml in a thin veneer).

You CAN do some limited preprocessing on the ML and resolve out lots of stuff (like all those rsidRPr elements, etc), but it's still going to be tricky to resolve out enough of the formatting elements to consistently be able to search the text.

Alternately, you could use XPATH to extract JUST the w:t elements, then string them all together and search the results, but then you've got the problem of how to know where in the document what you ended up finding actually lives.

if you don't care about that (for instance, if you're just data mining) then that might be the fastest solution.

Upvotes: 0

Related Questions