Reputation: 748
I have a HTML file(I can't use HTML AgilityPack) that I want to extract the id of a div(if it has one)
<div id="div1">Street ___________________ </div>
<div id="div2">CAP |__|__|__|__|__| number ______ </div>
<div id="div3">City _____________________ State |__|__|</div>
<div id="div4">City2 ____________________ State2 _____</div>
I have a pattern for extracting underscores __ : [\ _]{3,}
Now if I have a div in front of my underscores I want to extract it, if not I'll get only the underscores.
I have build so far this pattern (<div id(.+?)>(\w)([\ _]{3,}/*))([\ _]{3,})
The first part is build out of 3 groups 1 - a div tag, 2 - a label, 3 - underscores
1 - <div id(.+?)>
, 2 - (\w)
, 3 - [\ _]{3,}/*
The div with the id div2 will not take the id because it contains non-alfanumeric chars.
Q: What is wrong with my pattern ?
Desired matchs for the 4 divs:
<div id="div1">Street ___________________
______
<div id="div3">City _____________________
<div id="div4">City2 ____________________
_____
Upvotes: 0
Views: 82
Reputation: 55589
\w
is just a single character, you probably want to say one or more - \w+
.
/*
- zero or more /
's? I don't see where that fits in.
One or more not >
's (i.e. [^>]+
) is probably a better idea than .+?
. .+?
will try to stop at the first >
, but will continue until it finds a string that matches, i.e.:
<div id=1>this is not valid</div><div id=2>this is valid___</div>
will match the whole string, instead of just from <div id=2>
.
As far as I can tell from your question, everything before the underscores should be optional.
Pattern:
(?:(<div id[^>]+>)(\w+))?([\ _]{3,})
Upvotes: 1
Reputation: 111840
Try something like
string html = @"<div id=""div1"">Street ___________________ </div>
<div id=""div2"">CAP |__|__|__|__|__| number ______ </div>
<div id=""div3"">City _____________________ State |__|__|</div>
<div name=""hello"" id=""div4"">City _____________________ State |__|__|</div>
<div name=""house"">City _____________________ State |__|__|</div>
<div id=""notext""></div>";
var rx = new Regex(@"<div(?:(?: id=""(?<id>[^""]+)"")|[^>])*>(?<content>[^<]*)</div>",
RegexOptions.IgnoreCase);
var matches = rx.Matches(html);
foreach (Match match in matches)
{
var id = match.Groups["id"];
var content = match.Groups["content"];
Console.WriteLine("id present: {0}, id: {1}, text: {2}",
id.Success,
id.ToString(),
content.ToString());
}
if it work I'll explain the regex (that is <div(?:(?: id="(?<id>[^"]+)")|[^>])*>(?<content>[^<]*)</div>
)
Upvotes: 1