SoftwareGeek
SoftwareGeek

Reputation: 15782

XML - removing illegal chars from an xml document

I have an XML document that contains special characters like '%', Carriage return, line feed, &, <, >, ', ". I have tried to encode the entire xml document but that fails to load when using xmldocument.load method in C#.

What is the best way to remove these special characters without having to hardcode to replace the illegal chars with corresponding entity references?

Upvotes: 0

Views: 246

Answers (2)

Michael Kay
Michael Kay

Reputation: 163587

Where does the not-quite-XML document come from? Your focus should be on correcting the source of the document so that it produces proper XML. All the benefits of using XML are lost if people start sending stuff that is almost XML but not quite - you might as well use a completely proprietary format.

Upvotes: 5

Chris Heald
Chris Heald

Reputation: 62668

The short answer is that an XML-like document with invalid characters isn't a valid XML document, and needs to be made valid.

You have two possible fixes. The first, which you've already hinted at, is to replace the invalid characters with entities. The second would be to wrap any content with invalid characters in CDATA sections; you won't have to deal with any kind of encoding issues for content in those sections.

If neither of those is an option, you're going to need to figure out how to parse the document with a parser that doesn't care about invalid characters, which is probably a bad idea, and should be avoided if at all possible.

Upvotes: 5

Related Questions