Regex multiple occurrences of text between tags

I need to get a string consisting only of the text between a pair of defined tags and also a string consisting of the text including the tags. Since the text resides inside HTML <p> tags the < and > are interpreted like &lt; and &gt; (which as far as I know makes it impossible to use a Parser like the HTML Agility Pack)

So the input string looks like this:

Text outside of tags
&lt;internal&gt;    First occurance of text inside of tags    &lt;/internal&gt;
More text outside of tags
&lt;internal&gt;    Second occurance     &lt;/internal&gt;

I'm using the following code right now but it only gets the first occurance and not the second one:

Regex regex = new Regex(@"(&lt;internal&gt;(.*?)&lt;/internal&gt;)", RegexOptions.Singleline);
MatchCollection matches = regex.Matches(inputString);

foreach (Match match in matches)
{
    string outerMatch = match.Groups[1].Value;
    string innerMatch = match.Groups[2].Value;
}

Upvotes: 1

Views: 1891

Answers (3)

Oh, the code actually works. The reason it didn't pick up the second occurance was that the editor that is creating the documents inserted and tags for the text inside of the tags in some cases, which made the regex fail to match it. I changed the regex to this:

Regex regex = new Regex(@"(&lt;.*?internal.*?&gt;(.*?)&lt;.*?/.*?internal.*?&gt;)", RegexOptions.Singleline);

Thanks anyway!

Upvotes: 1

Ria
Ria

Reputation: 10347

use \< and \> instead of &lt; and &gt;

like this:

(\<internal\>(.*?)\</internal\>)

Upvotes: -1

Tigran
Tigran

Reputation: 62246

Again the question like this.

Do not use regualr expression for tags identification. Regular expressions are stateless and can not operate correctly with HTML or XML. You need to use a Parser for this.

Use Agility pack fot HTML parsing.

Upvotes: 1

Related Questions