Geesh_SO
Geesh_SO

Reputation: 2206

I'm having trouble with a multiline regex in C#, how do I fix this?

I have the following code to attempt to extract the content of li tags.

        string blah = @"<ul>
        <li>foo</li>
        <li>bar</li>
        <li>oof</li>
        </ul>";

        string liRegexString = @"(?:.)*?<li>(.*?)<\/li>(?:.?)*";
        Regex liRegex = new Regex(liRegexString, RegexOptions.Multiline);
        Match liMatches = liRegex.Match(blah);
        if (liMatches.Success)
        {
            foreach (var group in liMatches.Groups)
            {
                Console.WriteLine(group);
            }
        }
        Console.ReadLine();

The Regex started much simpler and without the multiline option, but I've been tweaking it to try to make it work.

I want results foo, bar and oof but instead I get <li>foo</li> and foo.

On top of this I it seems to work fine in Regex101, https://regex101.com/r/jY6rnz/1

Any thoughts?

Upvotes: 1

Views: 82

Answers (2)

Chris
Chris

Reputation: 27609

I will start by saying that I think as mentioned in comments you should be parsing HTML with a proper HTML parser such as the HtmlAgilityPack. Moving on to actually answer your question though...

The problem is that you are getting a single match because liRegex.Match(blah); only returns a single match. What you want is liRegex.Matches(blah) which will return all matches.

So your use would be:

var liMatches = liRegex.Matches(blah);
foreach(Match match in liMatches)
{
    Console.WriteLine(match.Groups[1].Value);
}

Upvotes: 3

Sweeper
Sweeper

Reputation: 271565

Your regex produces multiple matches when matched with blah. The method Match only returns the first match, which is the foo one. You are printing all groups in that first match. That will get you 1. the whole match 2. group 1 of the match.

If you want to get foo and bar, then you should print group 1 of each match. To do this you should get all the matches using Matches first. Then iterate over the MatchCollection and print Groups[1]:

string blah = @"<ul>
<li>foo</li>
<li>bar</li>
<li>oof</li>
</ul>";
string liRegexString = @"(?:.)*?<li>(.*?)<\/li>(?:.?)*";
Regex liRegex = new Regex(liRegexString, RegexOptions.Multiline);
MatchCollection liMatches = liRegex.Matches(blah);
foreach (var match in liMatches.Cast<Match>())
{
    Console.WriteLine(match.Groups[1]);
}

Upvotes: 2

Related Questions