Parsing XML-ish data

Question

Yes, I really am going to ask about parsing XML with regexes... here goes.

I have some XML-ish data, and I need to parse it. I can't do it completely with an XMLDocument or similar because it's not proper XML, and I'm not sure I can (or want to) change the format. The main problem is tags which have special meaning, and look like this:

<$ something_here $>

C#'s XmlDocument falls over parsing that, and I assume other methods will too. I could, with a lot of work, change the above to something like

But that's ugly, and I don't really want to. The reason it would be time consuming to change is that I have hundreds, maybe thousands of XML documents which would need to be changed.

At the moment, I'm parsing the document with regexes. I only need to pick out a couple of specific tags (not the ones above), and it seems to be working, but I'm uncomfortable with it. I'm doing something like this at the moment:

...

MatchCollection mc = Regex.Matches(Template, ""); // or similar
foreach (Match m in mc) {

    try {

        XmlDocument xd = new XmlDocument();
        xd.LoadXml(m.Value);

...

This at least means I'm not using regexes exclusively :)

Can anyone think of a better way? Is there some way of getting XmlDocument to politely ignore the $ character that causes it to fall over? It doesn't seem likely, but I thought I should at least get some opinions.

Justin · Accepted Answer

No, there is no way to get XmlDocument to parse a document which isn't xml, no matter how close to xml it might look!

If its possible to do then I would definitely recommend that you convert your documents to be actual xml (or at least some recognised document format). Trying to create and maintain a reliable working parser for any format is quite a lot of work, let alone a format that doesn't appear to be rigeriously defined.

Using a some_special_tag element to identify special sections seems like a good idea to me. If necessary you can use a different namespace to ensure no clashes with other elements in your document - this is in fact exactly the way that xslt works ("special" tags are used to mean special things, like templates or nodes that should be replaced) and exactly what xml was designed to support.

Also I don't understand why you would need to place the something_here bit in CDATA sections. All characters that "break" xml can be escaped fairly easily (for example by writing < as <). CDATA sections are generally only used when the contents of a node needs so much escaping that its easier and less messy to just to use CDATA sections instead.

Update: Regarding migration to a new format, can you not use both methods? Attempt to parse the document as an XML document (or if there are performance concerns then perform some other test to quickly determine if the document is in the "old" or "new" format such as checking for a version attribute in the root element) - if it doesn't work then fall back to the old method.

This way as long as everything is working fine (which is will be as long as nothing changes) users don't need to modify their documents, however if they run into problems or want to use any new features then explain to them that they must update their document to the new format.

Depending on how well your current "parser" works, you may even be able to provide an upgrade utility that automatically performns the conversion (as best it can).

Parsing XML-ish data

Answers (2)

Related Questions