Reputation: 45771
I have an XML structure like this, some Student item contains invalid UTF-8 byte sequenceswhich may cause XML parsing fail for the whole XML document.
What I want to do is, filter out Student item which contains UTF-8 byte sequences, and keep the valid byte sequences ones. Any advice or samples about how to do this in .Net (C# preferred)?
BTW: invalid byte sequences I mean => http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
<?xml version="1.0" encoding="utf-8"?>
<AllStudents>
<Student>
Mike
</Student>
<Student>
(Invalid name here)
</Student>
</AllStudents>
thanks in advance, George
Upvotes: 0
Views: 5171
Reputation: 23548
I don't know C#, so I'm afraid I can't give you code to do this, but the basic idea is to read the whole file as a utf-8 text file, using a DecoderFallback to replace invalid sequences with either question mark characters or the unicode chacter 0xFFFD. Then write the file back out as a utf-8 text file, and parse that.
Basically, you separate out the operation of "wiping out bad utf-8 sequences" from the operation of "parsing the xml file".
You should probably even be able to skip writing the file back out again before running the XML parser to read in the fixed data; there should be some way to write the file to an in-memory byte stream and parse that byte stream as XML. (Again, sorry for not knowing C#)
Upvotes: 2
Reputation: 958
That's pretty hard to do. You won't get an XML parser to parse a document with invalid characters in it, so I think you're reduced to a couple of options:
Upvotes: 2