user2847238
user2847238

Reputation: 169

XML Regex Extraction

I have an XML file and I need to extract data out of it.This task would be trivial if I only could use Xdocument, but the whole point of exercise is to create own parser using regex. The XML looks similar to below:

<A>
    <B>
        <C>ASD</C>
    </B>
    <B>
        <C>ZXC</C>
    </B>
</A>

I Came up with an idea that I can divide input to both closing and opening tag and their content.

        string acquiredFile = myStringBuilder.ToString();
        string regexPattern = "(?<open><[A-z0-9]{1,}>)(?<content>.*)(?<close></[A-z0-9]{1,}>)";
        Regex rx = new Regex(regexPattern, RegexOptions.Singleline);


        foreach (Match match in Regex.Matches(acquiredFile, regexPattern, RegexOptions.Singleline))
        {
            Console.WriteLine(match.Groups["open"].Value);
            Console.WriteLine(match.Groups["content"].Value);
            Console.WriteLine(match.Groups["close"].Value);
        }

I need to wrap it up in loop. Above extraction solution works only for single nested element in XML document such as:

<A>
    <B>
        <C>ASD</C>
    </B>
</A>

Could you please help me how to expand this code to get it to work with multiple nested elements.

Upvotes: 0

Views: 929

Answers (1)

HugoRune
HugoRune

Reputation: 13799

You can deal with nested elements by recursion:

Wrap the code you use into a function

Parse(string html)
{
    var matches = Regex.Matches(html, yourRegexp, RegexOptions.Singleline);
    if (!matches.Any())
    {
       Console.WriteLine("CONTENT:"+html);
    }
    foreach (Match match in matches)
    {
       Console.WriteLine("OPEN:"+match.Groups["open"].Value);
       parse(match.Groups["content"].Value);
       Console.WriteLine("CLOSE:"+match.Groups["close"].Value);
    }
}

However, let me discourage you a bit first:

The above approach will not work with your regex (?<open><[A-z0-9]{1,}>)(?<content>.*)(?<close></[A-z0-9]{1,}>).
The first problem, as you mentioned, are the multiple consecutive <B>...</B><B>...</B> tags. Your regexp will capture everything from the first <B> to the last </B> into one group.

Now, a simple bugfix for this problem would be this regex <(?<open>[A-z0-9]{1,})>(?<content>.*?)<\1>, which will non-greedily match anything between the first <TAGNAME> and the next </TAGNAME2>, where TAGNAME and TAGNAME2 are the same string.

Looks good? Well, it is not, because this regexp will fail for nested elements with the same name, like <B><C><B></B></C></B>.

You will continue to run into these problems. As you come up with more and more complicated regex there will always be some sort of counterexample that causes them to break.

This is because regex are the wrong tools for this sort of task. You are trying to capture a Chomsky type 3 grammar with a Chomsky type 2 grammar. (Also see this humorous take on the subject).

In the end writing a proper parser for xml is far from a simple task, that is why the usual recommendation is to always go with one of the standard ones.

Upvotes: 2

Related Questions