John
John

Reputation: 2852

pattern matching using XQuery

I have a huge XML document 200MB in size containing textual information. The data was earlier stored in pagemaker file with 2 Columns. After tagging I found that certain text is having hyphen. This is because the word(s) which were unable to fit the format were broken down in 2 words separated by hyphen. Also this XML document use hyphen for another reason. To separated short sentences (for Notes).

I want to find out those hyphens which are in between the words. I have noticed that the hyphen which I want to find an remove have a standard pattern. For Example.

The first use of hyphen - (Which I want to find and replace)

question is ques-tion answer would be ans-wer

The other use of hyphen is - (Not to be found)

Pattern matchin - Regex Expressions - ...

So the standard format for both is -

space-space

letter-letter

How can I use XQuery to find all these , ie the second one... Or any other way to find them... As finding and replacing these in huge XML file ... my god ..

Upvotes: 3

Views: 4577

Answers (1)

Jens Erat
Jens Erat

Reputation: 38682

200 MB is not huge. :)

If you're totally sure no hyphens are to be found in tag-/attribute-names, use sed (discouraged!):

sed -E 's/([[:alpha:]]+)\-([[:alpha:]]+)/\1\2/g' doc.xml out.xml

Better use XQuery for this, so you won't have to deal with complex XML syntax parsing:

declare function local:copy-replace($element as element()) {  
  element {node-name($element)}  
          {$element/@*, 
        for $child in $element/node()  
        return
            if ($child instance of element())
            then local:copy-replace($child)  
            else replace($child, "(\w+)\-(\w+)","$1$2")
          }  
};

local:copy-replace(/*)

It doesn't deal with attributes yet. If hyphenated texts occurs in attributes, you will have to extract and include them separately.

Some credits go to some unknown user in this answer I gladly remembered as a pattern.

Upvotes: 2

Related Questions