Surendra
Surendra

Reputation: 25

How to remove the special characters in XML and should not lead to the error "Invalid byte 1 of 1-byte UTF-8 sequence" while reading this xml file

I am getting an error

Invalid byte 1 of 1-byte UTF-8 sequence

while reading an XML file in Java to generate an XSD.

Then I noticed that my XML does have some special characters like '"”“?& etc. So, I have managed to remove them in Java before I process the XML to generate the XSD. But the challenge is that it is dynamic data, so we may not know what sort of characters we will encounter.

How do we can remove these special characters smartly? So that it would match the UTF-8 encoding and never have this problem?

Could this be solved in XSLT to remove the characters?

How do we can get rid of these characters from the below part or allow without issue?

 <string>message</string>
                    <string>Very good dear laughing colours laken yeh heart bhot karap hota ha brain ke baat nahi sunte ha Allah bhagwan god Na yeh kuy banayai ha dear friends 😢 😢 😢❤👍</string>

<string>message</string>
                    <string>वक़्त 🕔 और  दोस्त_मिलते 👫 तो  मुफ्त_हैं, ☺
लेकिन  उनकी_कीमत 💵 का  अंदाज़ा 😌 तब  होता_है, ☝  जब ये कहीं  खो_जाते है ।...
#</string>

Note: I have the encoding set as UTF-8 for the XML document.

Upvotes: 1

Views: 1962

Answers (1)

Patrick Dark
Patrick Dark

Reputation: 2259

Your error sounds like your XML document contains a single-byte control character that's prohibited in XML. XML prohibits certain characters from appearing in a document; see the Char production at https://www.w3.org/TR/xml/#charsets for the list of allowed characters in XML 1.0.

You need to remove these characters before they reach the XML; otherwise your XML will be malformed, at which point it's expected that XSLT won't be able to transform your document.

If you need to transform valid XML characters, XSLT can do that with the translate function. For example, translate(Windows-1252_string, "&#x84;&#x93;&#x94;", "&#x201e;&#x201c;&#x201d;") run on all text nodes should address Windows-1252-encoded quotation marks. Of course, it'd be better to ensure that this input is fixed before it reaches XML.

Upvotes: 0

Related Questions