Reputation: 2293
I have some invalid XML from a vendor that I need to process. Here is an example:
<a>foo</a>
<b>bar</b>
<c>foobar is < $15</c>
So, we have a few problems. First, there is no root document. I overcome that by adding a root document. No problem. The second, and more difficult problem, is the less than symbol. I can just encode the whole thing but it will encode the XML tags. Is there a library or simple method out there somewhere for handling this? I really don't want to reinvent the wheel as I'm sure hundreds of people have dealt with "quasi-XML" like this. Appreciate any help.
Upvotes: 0
Views: 72
Reputation: 280
I would read the file line by line and use a regex to get the values between the nodes. Your example doesn't have nested elements so this is pretty easy. While reading line by line you can replace encode the inner values. The named capture group (?.*?) will get everything between the nodes into the group named xml.
var regex = "<.*?>(?<xml>.*?)</.*?>"
var badXML = Regex.Match(line, regex , RegexOptions.IgnoreCase).Groups["xml"].Value;
Upvotes: 1