Reputation: 2852
I have a huge XML document 200MB in size containing textual information. The data was earlier stored in pagemaker file with 2 Columns. After tagging I found that certain text is having hyphen. This is because the word(s) which were unable to fit the format were broken down in 2 words separated by hyphen. Also this XML document use hyphen for another reason. To separated short sentences (for Notes).
I want to find out those hyphens which are in between the words. I have noticed that the hyphen which I want to find an remove have a standard pattern. For Example.
The first use of hyphen - (Which I want to find and replace)
question
is ques-tion
answer
would be ans-wer
The other use of hyphen is - (Not to be found)
Pattern matchin - Regex Expressions - ...
So the standard format for both is -
space-space
letter-letter
How can I use XQuery to find all these , ie the second one... Or any other way to find them... As finding and replacing these in huge XML file ... my god ..
Upvotes: 3
Views: 4577
Reputation: 38682
200 MB is not huge. :)
If you're totally sure no hyphens are to be found in tag-/attribute-names, use sed (discouraged!):
sed -E 's/([[:alpha:]]+)\-([[:alpha:]]+)/\1\2/g' doc.xml out.xml
Better use XQuery for this, so you won't have to deal with complex XML syntax parsing:
declare function local:copy-replace($element as element()) {
element {node-name($element)}
{$element/@*,
for $child in $element/node()
return
if ($child instance of element())
then local:copy-replace($child)
else replace($child, "(\w+)\-(\w+)","$1$2")
}
};
local:copy-replace(/*)
It doesn't deal with attributes yet. If hyphenated texts occurs in attributes, you will have to extract and include them separately.
Some credits go to some unknown user in this answer I gladly remembered as a pattern.
Upvotes: 2