user3416249
user3416249

Reputation: 93

Remove characters not in specified XSLT encoding

I am trying to transform an UTF-8 xml source file into an iso-8859-1 xml destination file. I would like the the XSLT to remove all characters that are not valid in iso-8859-1. Is it possible ?

The ideal way would be to receive the target encoding as a parameter, remove all invalid characters regarding this encoding and use the parameter to set the the encoding attribute in the xsl:output tag.

I executed the tests on a file which contains Chinese characters, my XSLT contains

<xsl:output method="xml" encoding="iso-8859-1" indent="yes" />

but the Chinese characters are transformed into things like &#20320 ;

Thanks in advance.

Upvotes: 1

Views: 3972

Answers (3)

Michael Kay
Michael Kay

Reputation: 163262

For iso-8859-1 you can do

replace($x, '[^&#x1;-&#xff;]', '')

But this doesn't generalize to other encodings.

If you're using Saxon then I would suggest customizing the serializer (you can set your own SerializerFactory, which can create a pipeline containing your own XMLEmitter, which can subclass the standard XMLEmitter to omit characters that aren't in the chosen encoding instead of escaping them).

Alternatively, postprocess the output (e.g. with Perl or Awk) to remove all numeric character references.

However, more than that, I would question the requirement. What you want to do doesn't seem a good thing to do.

Upvotes: 1

Tomalak
Tomalak

Reputation: 338108

XSL output encoding determines the encoding the output file is in.

It guarantees that no character written to the output file/stream is outside the defined range of characters for, in this case, iso-8859-1. And the string '&#20320;' is in that range, even though the character it represents (U+4F60, 你) isn't.

The <output charset="..."> directive switches byte encoding (e.g. '你' is 0xE4 0xBD 0xA0 in UTF-8 and 0x60 0x4F in UTF-16) but if that's not possible it does not clobber your text, i.e. it will not replace Chinese characters in the input to question marks (or even worse, nothing) in the output.

It tries to keep the character by using a well-defined encoding scheme: a numbered character entity. The user agent that displays the data is free to display it as a question mark or, if it has the capability, as the original character.

The following XML:

<?xml version="1.0" encoding="iso-8859-1"?>
<test>&#20320;</test>

and

<?xml version="1.0" encoding="UTF-8"?>
<test>你</test>

both display as

<test>你</test>

in my browser, so what your XSLT processor does is actually the Right Thing. Think if you really want to lose those characters.

Upvotes: 1

michael.hor257k
michael.hor257k

Reputation: 116959

Assuming XSLT 1.0:
It's possible, but rather tedious. You need to list all characters in the set, then use the translate() function (twice) on every text node you output to the result tree. For example, this stylesheet:

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>

<xsl:param name="charset" select="'1234567890'" />

<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="text()">
    <xsl:value-of select="translate(., translate(., $charset, ''), '')"/>
</xsl:template>

</xsl:stylesheet>

when applied to the following input:

<input>
    <para>John has 3 apples.</para>
    <para>Eve has 2 oranges.</para>
</input>

will result in:

<?xml version="1.0" encoding="UTF-8"?>
<input>
  <para>3</para>
  <para>2</para>
</input>

Upvotes: 1

Related Questions