Reputation: 1811

Need some C# Regular Expression Help

I'm trying to come up with a regular expression that will stop at the first occurence of </ol>. My current RegEx sort of works, but only if </ol> has spaces on either end. For instance, instead of stopping at the first instance in the line below, it'd stop at the second

some random text <a href = "asdf">and HTML</a></ol></b> bla </ol>

Here's the pattern I'm currently using: string pattern = @"some random text(.|\r|\n)*</ol>";

What am I doing wrong?

Upvotes: 0

Answers (5)

stema

Reputation: 92986

Others had already explained the missing ? to make the quantifier non greedy. I want to suggest also another change.

I don't like your (.|\r|\n) part. If you have only single characters in your alternation, its simpler to make a character class [.\r\n]. This is doing the same thing and its better to read (I don't know compiler wise, maybe its also more efficient).

BUT in your special case when the alternatives to the . are only newline characters, this is also not the correct way. Here you should do this:

Regex A = new Regex(@"some random text.*?</ol>", RegexOptions.Singleline);

Use the Singleline modifier. It just makes the . match also newline characters.

Upvotes: 0

ridgerunner

Reputation: 34395

This regex matches everything from the beginning of the string up to the first </ol>. It uses Friedl's "unrolling-the-loop" technique, so is quite efficient:

Regex pattern = new Regex(
    @"^[^<]*(?:(?!</ol\b)<[^<]*)*(?=</ol\b)",
    RegexOptions.IgnoreCase);
resultString = pattern.Match(text).Value;

Upvotes: 0

Tim

Reputation: 28530

While not a Regex, why not simply use the Substring functions, like:

string returnString = someRandomText.Substring(0, someRandomText.IndexOf("</ol>") - 1);

That would seem to be a lot easier than coming up with a Regex to cover all the possible varieties of characters, spaces, etc.

Upvotes: 1

Brad Christie

Reputation: 101614

Make your wild-card "ungreedy" by adding a ?. e.g.

some random text(.|\r|\n)*?</ol>
                          ^- Addition

This will make regex match as few characters as possible, instead of matching as many (standard behavior).

Oh, and regex shouldn't parse [X]HTML

Upvotes: 2

Mike Caron

Reputation: 14561

string pattern = @"some random text(.|\r|\n)*?</ol>";

Note the question mark after the star -- that tells it to be non greedy, which basically means that it will capture as little as possible, rather than the greedy as much as possible.

Upvotes: 3

Need some C# Regular Expression Help

Answers (5)

Related Questions