Matt Cashatt
Matt Cashatt

Reputation: 24208

Using C#, how do I close malformed XML tags?

Background

I have inherited a load of XML files that consistently contain a tag with two openings rather than an opening and a closure. I need to loop through all of these files and correct the malformed XML.

Here is a simplified example of the bad XML which is the exact same tag in every file:

<meals>
    <breakfast>
         Eggs and Toast
    </breakfast>
    <lunch>
         Salad and soup
    <lunch>
    <supper>
         Roast beef and potatoes
    </supper>
</meals>

Notice that the <lunch> tag has no closure. This is consistent in all of the files.

Question

Would it be best to use regex for C# to fix this and, if so, how would I do that exactly?

I already know how to iterate the file system and read the docs into either an XML or string object so you don't need to answer that part.

Thanks!

Upvotes: 3

Views: 2461

Answers (4)

Michael Kay
Michael Kay

Reputation: 163312

It's best to avoid thinking of these as XML files: they are non-XML files. This immediately tells you that tools designed for processing XML will be no use, because the input is not XML. You need to use text-based tools. On UNIX this would be things like sed/awk/perl; I've no idea what the equivalent would be on Windows.

Upvotes: -2

VinayC
VinayC

Reputation: 49185

If the only issue within your xml files is what you have shown then Chesso's answer should suffice the need. In fact, I would go that route even if it full-fills my 80-90% needs - rest of the cases, I may choose to handle manually or write specific handling code.

Said that, if file structure is complicated and not a simple as you describe then you should probably look at some text lexer that will allow you to break your file content into tokens. The semantic analysis of tokens to check and correct irregularities has to be done by you but at least parsing the text would be much more simpler. See few resources below that links to lexing in C#:

  1. http://blogs.msdn.com/b/drew/archive/2009/12/31/a-simple-lexer-in-c-that-uses-regular-expressions.aspx
  2. Poor man's "lexer" for C#
  3. http://www.seclab.tuwien.ac.at/projects/cuplex/lex.htm

Upvotes: 0

Cheeso
Cheeso

Reputation: 192467

If your broken XML is relatively simple, as you've shown in the question, then you can get away with some simplistic logic and a basic regular expression.

    public static void Main(string[] args)
    {
        string broken = @"
<meals>
    <breakfast>
         Eggs and Toast
    </breakfast>
    <lunch>
         Salad and soup
    <lunch>
    <supper>
         Roast beef and potatoes
    </supper>
</meals>";

        var pattern1 = "(?<open><(?<tag>[a-z]+)>)([^<]+?)(\\k<open>)";
        var re1 = new Regex(pattern1, RegexOptions.Singleline);

        String work = broken;
        Match match = null;
        do
        {
            match = re1.Match(work);
            if (match.Success)
            {
                Console.WriteLine("Match at position {0}.", match.Index);
                var tag = match.Groups["tag"].ToString();

                Console.WriteLine("tag: {0}", tag.ToString());

                work = work.Substring(0, match.Index) +
                    match.Value.Substring(0, match.Value.Length - tag.Length -1) +
                    "/" +
                    work.Substring(match.Index + match.Value.Length - tag.Length -1);

                Console.WriteLine("fixed: {0}", work);
            }
        } while (match.Success);
    }

That regex uses the "named" capture group feature of .NET regular expressions. The ?<open> indicates that the group captured by the enclosing parens will be accessible by the name "open". That grouping captures the opening tag, including angle brackets. It presumes there is no xml attribute on the opening tag. Within that grouping, there is another named group - this one uses the name "tag" and captures the tag name itself, without angle brackets.

The regex then lazily captures a bunch of intervening text ((.+?)), and then another "open" tag, which is specified with a back-reference. The lazy capture is there so it doesn't slurp up any possible intervening open tag in the text.

Because the XML may span multiple newlines, you need the RegexOptions.Singleline.

The logic then applies this regex in a loop, replacing any matched text with a fixed version - valid xml with a closing tag. The fixed XML is produced with simple string slicing.

This regex won't work if:

  • there are XML attributes on the opening tag
  • there is weird spacing - whitespace between the angle brackets enclosing a tag name
  • the tag names use dashes or numbers or anything that is not a lowercase ASCII character
  • the string between includes angle brackets (in CDATA)

...but the approach will still work. You just would need to tweak things a little.

Upvotes: 3

Ethan Brown
Ethan Brown

Reputation: 27282

I think regular expressions would be a little bit of an overkill if the situation is truly as simple as you describe it (i.e., it's always the same tag, and there's always only one of them). If your XML files are relatively small (kilobytes, not megabytes), you can just load the whole thing into the memory, use string operations to insert the missing slash, and call it a day. This will be considerably more efficient (faster) than trying to use regular expressions. If your files are very large, you can just modify it to read in the file line-by-line until it finds the first <lunch> tag, then look for the next one and modify it accordingly. Here's some code for you to get started:

var xml = File.ReadAllText( @"C:\Path\To\NaughtyXml.xml" );

var firstLunchIdx = xml.IndexOf( "<lunch>" );
var secondLunchIdx = xml.IndexOf( "<lunch>", firstLunchIdx+1 );

var correctedXml = xml.Substring( 0, secondLunchIdx + 1 ) + "/" +
xml.Substring( secondLunchIdx + 1 );

File.WriteAllText( @"C:\Path\To\CorrectedXml.xml", correctedXml );

Upvotes: 2

Related Questions