Damounet
Damounet

Reputation: 120

Prevent html entities resolving in xsl processing

I have a java program which handles xml files. Those files are in S1000D format, used for technical documentation. I need to update some meta data in the xml files and I am using SAXON to do so.

But Saxon is doing more transformations than the ones in my xsl.

Here is an extract of one of my input file :

<dmodule xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.s1000d.org/S1000D_4-1/xml_schema_flat/schedul.xsd">
...
    <reqSpares>
        <noSpares></noSpares>
    </reqSpares>
    <reqSafety>
        <noSafety></noSafety>
    </reqSafety>
...
    <timeLimit>
        <remarks>
            <simplePara>Lorem ipsum</simplePara>
            <simplePara>Lorem ipsum dolor sit amet, consectetur adipiscing elit.&#xA;Vestibulum pulvinar sapien at lacus lacinia,&#xA;eu maximus arcu vestibulum.</simplePara>
        </remarks>
    </timeLimit>
...

And here is the result of my transformation:

<dmodule xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.s1000d.org/S1000D_4-1/xml_schema_flat/schedul.xsd">
...
    <reqSpares>
        <noSpares/>
    </reqSpares>
    <reqSafety>
        <noSafety/>
    </reqSafety>
...
    <timeLimit>
        <remarks>
            <simplePara>Lorem ipsum</simplePara>
            <simplePara>Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Vestibulum pulvinar sapien at lacus lacinia,
eu maximus arcu vestibulum.</simplePara>
        </remarks>
    </timeLimit>
...

Even if my xsl does not transform anything on those lines, they are transformed like so.

My requirements are that I do not have the permission to alter in whatsoever reason the structure or the content of the xml I am transforming like it is done in this example. The service that provides the input does not want to edit the input and add the entity declaration at the start of the xml file or encapsulate the html entities inside a CDATA tag.

In Saxon, we have tried:

I also have looked into BaseX too but the problem is the same, and I am not an expert enough in this library to find if it is possible to achieve the behavior.

Any help would be appreciated !

Upvotes: 1

Views: 200

Answers (1)

Michael Kay
Michael Kay

Reputation: 163360

Distinctions like the difference between <foo/> and <foo></foo> are lost by the time the data has been parsed (similarly, the use of single vs double quotation marks around attributes, whitespace within start and end tags, etc), and XML parsers don't provide any way of disabling expansion of entity references. Since XSLT operates on the output of an XML parser, if an XSLT processor doesn't see such distinctions then it can't preserve them.

Keeping entity references intact is a perfectly reasonable requirement, and my usual workaround is to use a text editor to globally replace & with § (after first checking that § doesn't appear in the file, of course) and then reverse the process on completion.

Keeping the exact lexical form of start and end tags is a much more questionable requirement. If you're being asked to do this, then the requirement is coming from someone who doesn't understand XML. Saxon gives you a lot of control over how the output is serialized (for example the serialization option saxon:canonical="yes" prevents use of empty element tags in the result), but it doesn't allow you to preserve whatever was in the input. If you're being told that's the requirement, then you need to ask "why" and "how much are you prepared to pay for this" - it will add greatly to your costs because you can forget all off-the-shelf XML processing libraries.

Upvotes: 2

Related Questions