Reputation: 59
I have a xml file with invalid characters. I searched through internet and haven't found any other way than reading the file as a text file and replace invalid characters one by one.
Can somebody please tell me an easiest way to remove invalid characters from a xml file..
ex xml stream:
<Year>where 12 > 13 occures </Year>
Upvotes: 1
Views: 1904
Reputation: 116178
I would try HtmlAgilityPack
. At least better than trying to parse manually.
HtmlAgilityPack.HtmlDocument hdoc = new HtmlAgilityPack.HtmlDocument();
hdoc.LoadHtml("<Year>where 12 > 13 occures </Year>");
using(StringWriter wr = new StringWriter())
{
using (XmlWriter xmlWriter = XmlWriter.Create(wr,
new XmlWriterSettings() { OmitXmlDeclaration = true }))
{
hdoc.Save(xmlWriter);
Console.WriteLine(wr.ToString());
}
}
this outputs:
<year>where 12 > 13 occures </year>
Upvotes: 3
Reputation: 163625
Start by thinking of the question differently. Your problem is that the input isn't valid XML. So you actually want to remove invalid characters from a non-XML file. That might sound pedantic, but it immediately indicates that tools designed for processing XML will be no use to you, because your input is not XML.
Fixing the problem at source is always better than trying to repair the damage later. But it you are going to embark on a repair strategy, the first thing is to define precisely what faults in the data you want to repair and how you intend to repair them. It's also a good idea to say clearly what constraints you apply to the solution: for example, does it matter if your repair accidentally changes the contents of any comments or CDATA sections?
Once you have defined your repair strategy: e.g. "replace any & by &
if it is not immediately followed by either #nn; or #xnn; or a name followed by ';', coding it up becomes quite straightforward.
Upvotes: 0