Reputation: 25
I am getting an error
Invalid byte 1 of 1-byte UTF-8 sequence
while reading an XML file in Java to generate an XSD.
Then I noticed that my XML does have some special characters like '"”“?& etc. So, I have managed to remove them in Java before I process the XML to generate the XSD. But the challenge is that it is dynamic data, so we may not know what sort of characters we will encounter.
How do we can remove these special characters smartly? So that it would match the UTF-8 encoding and never have this problem?
Could this be solved in XSLT to remove the characters?
How do we can get rid of these characters from the below part or allow without issue?
<string>message</string>
<string>Very good dear laughing colours laken yeh heart bhot karap hota ha brain ke baat nahi sunte ha Allah bhagwan god Na yeh kuy banayai ha dear friends 😢 😢 😢❤👍</string>
<string>message</string>
<string>वक़्त 🕔 और दोस्त_मिलते 👫 तो मुफ्त_हैं, ☺
लेकिन उनकी_कीमत 💵 का अंदाज़ा 😌 तब होता_है, ☝ जब ये कहीं खो_जाते है ।...
#</string>
Note: I have the encoding set as UTF-8 for the XML document.
Upvotes: 1
Views: 1962
Reputation: 2259
Your error sounds like your XML document contains a single-byte control character that's prohibited in XML. XML prohibits certain characters from appearing in a document; see the Char
production at https://www.w3.org/TR/xml/#charsets for the list of allowed characters in XML 1.0.
You need to remove these characters before they reach the XML; otherwise your XML will be malformed, at which point it's expected that XSLT won't be able to transform your document.
If you need to transform valid XML characters, XSLT can do that with the translate
function. For example, translate(Windows-1252_string, "„“”", "„“”")
run on all text nodes should address Windows-1252-encoded quotation marks. Of course, it'd be better to ensure that this input is fixed before it reaches XML.
Upvotes: 0