Reputation: 89
I'm confused about how to get XQuery to handle whitespace like I want it to. Say I have to following XML:
<body>
to<lb/>
<choice norm="Miss">Mi<glyph ref="#sm-long-s>s</glyph>s</choice>
<name type="person"><forename>Margaret</forename> <surname>Hamilton</surname></name><lb />
<name type="place">S<hi rend="superscript">t</hi> James's</name>
</body>
If I use this code
for $body in /body
return replace(string-join(
for $t in $body//node()
return
typeswitch($t)
case text() return
if (
sum(
for $a in $t/ancestor::*
return
typeswitch($a)
case element(choice) return 1
default return 0
)=0
) then $t
else null
case element(lb) return ' '
case element(choice) return $t/@norm
default return null
),"\s+"," ")
I get the following output:
to MissMargaretHamilton St James's
rather than the expected
to Miss Margaret Hamilton St James's
Is there a way to fix that?
PS: There is no such thing as <forename>
in the actual code, but I introduced it in this example to showcase both the linebreak and the space between > and < being ignored.
Upvotes: 3
Views: 783
Reputation: 163262
For good measure, here is how I would rewrite your query:
normalize-space(string-join(
for $t in /body//node()
return
typeswitch($t)
case text() return $t[not(ancestor::choice)]
case element(lb) return ' '
case element(choice) return $t/@norm
default return ()
))
Upvotes: 1
Reputation: 163262
There are some very strange things about this query. For example, it seems to me that this subexpression:
sum(
for $a in $t/ancestor::*
return
typeswitch($a)
case element(choice) return 1
default return 0
)=0
is just a convoluted way of writing empty($t/ancestor::choice)
.
And what is "null"? It looks to me like an element name that won't match anything in your input, hence a convoluted way of writing ()
.
What's more, your XML isn't well-formed: there's a missing quote on the ref attribute. That makes me suspect that the problem as submitted is not the problem as originally executed, so you might have inadvertently removed the clue to the solution.
However, if I fix the missing quote and run the query in Saxon, it produces the expected output. So I think the problem is that there is a bug (or to be more polite, a non-conformance) in your XQuery processor.
LATER: On reflection, I suspect you are using an XML parser that strips whitespace text nodes. This is a notorious quirk of the Microsoft MSXML parser, and makes it pretty useless for handling mixed content where such whitespace is significant. I believe it can be configured to behave "properly", but I've completely forgotten how.
The XQuery specs do leave processors some latitude in this area: they allow the XDM input tree to be constructed in any way that the processor fancies, which might include stripping all whitespace, or stripping every occurrence of the letter "x". At this point it's a question of whether you find the design choices made by your particular XQuery processor acceptable.
Upvotes: 2
Reputation: 3056
XML whitespace-handling can get quite tricky. I often have to experiment to get things just right.
I like to write transformation functions, and primarily handle different elements in my typeswitch
:
declare function local:transform($x)
{
typeswitch($x)
case element(choice) return $x/@norm/fn:string()
case element(name) return
if ($x/forename)
then fn:string-join($x/node()/fn:string(), " ")
else $x/fn:string()
case element() return
for $y in $x/node()
return local:transform($y)
default return fn:string($x)
};
let $x := (: your sample xml :)
return fn:replace(fn:string-join(local:transform($x), " "), "\s+", " ")
This sample should return your desired output. And it's easy to add cases for other elements, comment out existing cases, etc.
Upvotes: 0