joelproko
joelproko

Reputation: 89

Inconsistent Whitespace handling in XQuery?

I'm confused about how to get XQuery to handle whitespace like I want it to. Say I have to following XML:

<body>
to<lb/>
<choice norm="Miss">Mi<glyph ref="#sm-long-s>s</glyph>s</choice>
<name type="person"><forename>Margaret</forename> <surname>Hamilton</surname></name><lb />
<name type="place">S<hi rend="superscript">t</hi> James's</name>
</body>

If I use this code

for $body in /body
return replace(string-join(
    for $t in $body//node()
    return
        typeswitch($t)
        case text() return
            if (
                sum(
                    for $a in $t/ancestor::*
                    return
                        typeswitch($a)
                        case element(choice) return 1
                        default return 0
                )=0
            ) then $t
            else null
        case element(lb) return ' '
        case element(choice) return $t/@norm
        default return null
),"\s+"," ")

I get the following output:

to MissMargaretHamilton St James's

rather than the expected

to Miss Margaret Hamilton St James's

Is there a way to fix that?

PS: There is no such thing as <forename> in the actual code, but I introduced it in this example to showcase both the linebreak and the space between > and < being ignored.

Upvotes: 3

Views: 783

Answers (3)

Michael Kay
Michael Kay

Reputation: 163262

For good measure, here is how I would rewrite your query:

normalize-space(string-join(
    for $t in /body//node()
    return
        typeswitch($t)
        case text() return $t[not(ancestor::choice)]
        case element(lb) return ' '
        case element(choice) return $t/@norm
        default return ()
))

Upvotes: 1

Michael Kay
Michael Kay

Reputation: 163262

There are some very strange things about this query. For example, it seems to me that this subexpression:

            sum(
                for $a in $t/ancestor::*
                return
                    typeswitch($a)
                    case element(choice) return 1
                    default return 0
            )=0 

is just a convoluted way of writing empty($t/ancestor::choice).

And what is "null"? It looks to me like an element name that won't match anything in your input, hence a convoluted way of writing ().

What's more, your XML isn't well-formed: there's a missing quote on the ref attribute. That makes me suspect that the problem as submitted is not the problem as originally executed, so you might have inadvertently removed the clue to the solution.

However, if I fix the missing quote and run the query in Saxon, it produces the expected output. So I think the problem is that there is a bug (or to be more polite, a non-conformance) in your XQuery processor.

LATER: On reflection, I suspect you are using an XML parser that strips whitespace text nodes. This is a notorious quirk of the Microsoft MSXML parser, and makes it pretty useless for handling mixed content where such whitespace is significant. I believe it can be configured to behave "properly", but I've completely forgotten how.

The XQuery specs do leave processors some latitude in this area: they allow the XDM input tree to be constructed in any way that the processor fancies, which might include stripping all whitespace, or stripping every occurrence of the letter "x". At this point it's a question of whether you find the design choices made by your particular XQuery processor acceptable.

Upvotes: 2

joemfb
joemfb

Reputation: 3056

XML whitespace-handling can get quite tricky. I often have to experiment to get things just right.

I like to write transformation functions, and primarily handle different elements in my typeswitch:

declare function local:transform($x)
{
  typeswitch($x)
  case element(choice) return $x/@norm/fn:string()
  case element(name) return
    if ($x/forename)
    then fn:string-join($x/node()/fn:string(), " ")
    else $x/fn:string()
  case element() return
    for $y in $x/node()
    return local:transform($y)
  default return fn:string($x)
};

let $x := (: your sample xml :)
return fn:replace(fn:string-join(local:transform($x), " "), "\s+", " ")

This sample should return your desired output. And it's easy to add cases for other elements, comment out existing cases, etc.

Upvotes: 0

Related Questions