George2
George2

Reputation: 45771

XML UTF-8 encoding checking

I have an XML structure like this, some Student item contains invalid UTF-8 byte sequenceswhich may cause XML parsing fail for the whole XML document.

What I want to do is, filter out Student item which contains UTF-8 byte sequences, and keep the valid byte sequences ones. Any advice or samples about how to do this in .Net (C# preferred)?

BTW: invalid byte sequences I mean => http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

<?xml version="1.0" encoding="utf-8"?>
<AllStudents>
  <Student>
    Mike
  </Student>
  <Student>
    (Invalid name here)
  </Student>  
</AllStudents>

thanks in advance, George

Upvotes: 0

Views: 5171

Answers (3)

Daniel Martin
Daniel Martin

Reputation: 23548

I don't know C#, so I'm afraid I can't give you code to do this, but the basic idea is to read the whole file as a utf-8 text file, using a DecoderFallback to replace invalid sequences with either question mark characters or the unicode chacter 0xFFFD. Then write the file back out as a utf-8 text file, and parse that.

Basically, you separate out the operation of "wiping out bad utf-8 sequences" from the operation of "parsing the xml file".

You should probably even be able to skip writing the file back out again before running the XML parser to read in the fixed data; there should be some way to write the file to an in-memory byte stream and parse that byte stream as XML. (Again, sorry for not knowing C#)

Upvotes: 2

John Snelson
John Snelson

Reputation: 958

That's pretty hard to do. You won't get an XML parser to parse a document with invalid characters in it, so I think you're reduced to a couple of options:

  1. Figure out why the encoding is wrong - a common problem is labeling the document as UTF-8 (or having no encoding declaration) when the document is actually written in Latin-1.
  2. Take out the bad sections by hand.
  3. Try and find a tag soup parser for .NET that will continue parsing after the error.
  4. Reject the invalid XML document.

Upvotes: 2

bortzmeyer
bortzmeyer

Reputation: 35479

Very close from XML encoding issue.

Upvotes: 1

Related Questions