Reputation: 763
I need to get a string consisting only of the text between a pair of defined tags and also a string consisting of the text including the tags. Since the text resides inside HTML <p>
tags the < and > are interpreted like <
and >
(which as far as I know makes it impossible to use a Parser like the HTML Agility Pack)
So the input string looks like this:
Text outside of tags
<internal> First occurance of text inside of tags </internal>
More text outside of tags
<internal> Second occurance </internal>
I'm using the following code right now but it only gets the first occurance and not the second one:
Regex regex = new Regex(@"(<internal>(.*?)</internal>)", RegexOptions.Singleline);
MatchCollection matches = regex.Matches(inputString);
foreach (Match match in matches)
{
string outerMatch = match.Groups[1].Value;
string innerMatch = match.Groups[2].Value;
}
Upvotes: 1
Views: 1891
Reputation: 763
Oh, the code actually works. The reason it didn't pick up the second occurance was that the editor that is creating the documents inserted and tags for the text inside of the tags in some cases, which made the regex fail to match it. I changed the regex to this:
Regex regex = new Regex(@"(<.*?internal.*?>(.*?)<.*?/.*?internal.*?>)", RegexOptions.Singleline);
Thanks anyway!
Upvotes: 1
Reputation: 10347
use \<
and \>
instead of <
and >
like this:
(\<internal\>(.*?)\</internal\>)
Upvotes: -1
Reputation: 62246
Again the question like this.
Do not use regualr expression for tags identification. Regular expressions are stateless and can not operate correctly with HTML or XML. You need to use a Parser for this.
Use Agility pack fot HTML parsing.
Upvotes: 1