Phoexo
Phoexo

Reputation: 2556

Regex matching too much

(\[(c|C)=)(#?([a-fA-F0-9]{1,2}){3})\](.*)\[/(c|C)\]

I want this expression to match text like: "This is [c=FFFFFF]white text[/c] and [C=#000]black text[/C]."

It do match one BB-code alone, but if there are more after each other (like in the example), it will create a match (1 match) of both BB-code-sequences. (from [c=FFFFFF]wh... to ...ck text[/C])

Why is this happening? Also, how do I make the dot (.) include newlines in C#?

Upvotes: 1

Views: 433

Answers (5)

skyfoot
skyfoot

Reputation: 20769

You need a lazy regular expression to not pick up all of the [c] tags

Try this

\[c=(#?.*?)\](.*?)\[/c\] or
\[c=(#?\w*?)\](\w*?)\[/c\]

You should set the options on your regex object to ingnore case.

Upvotes: 0

ybo
ybo

Reputation: 17152

If you don't care about nested tags, you can do that :

(\[[cC]=)(#?([a-fA-F0-9]{3}){1,2})\](.*?)\[/[cC]\]
//                                     ^- lazy match

If you want to handle nested tags with regex, check this article on code project.

Upvotes: 3

Adam Luter
Adam Luter

Reputation: 2253

Regex is a quick an dirty way to do this, and the solution here is to use .*? rather than just .*. However, if you want a more robust solution is probably easier without regex. In C# you happen to be able to do nested structures, but that doesn't mean it's actually easy. It would be better to use a lexical parser and construct a DOM. Most likely the code will be easier to read and maintain.

Upvotes: 0

acezanne
acezanne

Reputation: 96

Dot matches newline characters if you set the option RegexOptions.Singleline (more on that here).

Upvotes: 2

unwind
unwind

Reputation: 399793

This happens because the RE is greedy; it will always try to produce the largest possible match.

It should be possible to make your RE engine non-greedy, see the linked document for tips on what to try.

Upvotes: 1

Related Questions