Reputation: 11
It would be great if someone could provide me the Regular expression for the following string.
Sample 1: <div>abc</div><br>
Sample 2: <div>abc</div></div></div></div></div><br>
As you can see in the samples provided above, I need to match the string no matter how many number of </div>
occurs.
If there occurs any other string between </div>
and <br>
, say like this <div>abc</div></div></div>DEF</div></div><br>
OR <div>abc</div></div></div></div></div>DEF<br>
, then the Regex should not match.
Thanks in advance.
Upvotes: 1
Views: 214
Reputation: 430
You could also include a named group in the the expression, e.g.:
<div>(?<text>[^<]*)(?:<\/div>)*<br>
Implemented in C#:
var regex = new Regex(@"<div>(?<text>[^<]*)(?:<\/div>)*<br>");
Func<Match, string> getGroupText = m => (m.Success && m.Groups["text"] != null) ? m.Groups["text"].Value : null;
Func<string, string> getText = s => getGroupText(regex.Match(s));
Console.WriteLine(getText("<div>abc</div><br>"));
Console.WriteLine(getText("<div>123</div></div></div></div></div><br>"));
Upvotes: 1
Reputation: 2295
I think, this regex is more flexible:
<div\b[^><]*+>(?>.*?</div>)(?:\s*+</div>)*+\s*+<br(?:\s*+/)?>
I don't include the ^
and $
in the beginning and the end of my regex because we cannot assure that your sample will always in a single line.
Upvotes: 0
Reputation: 85458
Try this:
<div>([^<]+)(?:<\/div>)*<br>
As seen on rubular
Notes:
abc
part (or anything that has a <
symbol).^<div>([^<]+)(?:<\/div>)*<br>$
if you want your string to match the pattern exactly.abc
part to be empty, use *
instead of +
That being said, you should be wary of using regex to parse HTML.
In this example, you can use regex because you are parsing a (hopefully) known, regular subset of HTML. But a more robust solution (ie: an [X]HTML parser like HtmlAgilityPack) is preferred when it comes to parsing HTML.
Upvotes: 3
Reputation: 399
NullUserException's answer is good. Here are a couple of questions, and variations, depending on what you want.
Do you want to prevent anything from occurring before the open div tag? If so, keep the ^ at the beginning of the regex. If not, drop it.
The rest of this post refers to the following section of the regex:
([^<]+?)
Do you want to capture the contents of the div, or just know that it matches your form? To capture, leave it as is. If you don't need to capture, drop the parentheses from the above.
Do you want to match if there is nothing inside the div? If so change the + in the above to *
Finally, although it will work fine, you don't need the ? in the above.
Upvotes: 0
Reputation: 2333
You need to use a real parser. Things like infinitely nested tags can't be handled via regex.
Upvotes: 1