Kumar
Kumar

Reputation: 501

Invalid XML character : xslt error while processing xml

While processing an xml with xslt, i get the following error but i could not see those characters in the xml

Character reference "&#16" is an invalid XML character.
Character reference "&#4" is an invalid XML character.
Character reference "&#4" is an invalid XML character.
Character reference "&#18" is an invalid XML character.
Character reference "&#1" is an invalid XML character.
Character reference "&#2" is an invalid XML character.
Character reference "&#25" is an invalid XML character.

Please advise.

The xml is formed from csv text file that has utf 8 character encoding.

Upvotes: 1

Views: 5498

Answers (4)

Michael Kay
Michael Kay

Reputation: 163262

These character references are legal in XML 1.1 but not in XML 1.0. Check whether the XML parser you are using supports XML 1.1, and whether the XML declaration at the top of the file specifies <?xml version="1.1"?>.

Upvotes: 2

zx485
zx485

Reputation: 29022

These are non-printable ASCII control codes ranging from 0 or 1 to 31 decimal in the ASCII table. They are invisible in a text editor so you don't see them. If you can switch your editor to hex mode, you'll find values like 04h=4, 12h=18d, and so on next to normal UTF-8(or other)-encodings like 41h for 'A', 42h for 'B'.

So the easiest way to get rid of them is using a tool that filters these out. Using linux you could use the approach described here.

Upvotes: 1

xjuice
xjuice

Reputation: 310

The number after &# is an ASCII code in decimal format (&#x would specify code in hexadecimal format). These codes, 16, 4, 18, etc. don't specify any printable character, but they are control characters that are usually not visible in text editors by default. These characters or actually bytes are not allowed in XML (with few exceptions), so your XML is invalid.

The CSV file probably contained these illegal bytes and the XML was formed without any kind of content validation (i.e. the contents of the CSV file have been just copied byte-by-byte to the XML).

Here are some options:

  • Check if you XSLT processor can be configured to ignore these illegal bytes.
  • Clean those characters yourself with some low-level data processor that just reads through the bytes and drops all illegal ones from it.
  • If the csv-to-xml transformation is under your control, fix that to produce valid XML.
  • If it's some third party application, you should request a fix from the supplier.
  • Use some other tool for creating the XML from the CSV file.

Upvotes: 3

Mads Hansen
Mads Hansen

Reputation: 66714

Those are control characters. Control characters and characters out of the Unicode ranges are not allowed. This means also that calling for example the character entity &#x4; is forbidden.

see XML recommendation 1.0, §2.2 Characters

The global list of allowed characters is:

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Upvotes: 1

Related Questions