Misi
Misi

Reputation: 748

Parsing multiple groups

I have a HTML file(I can't use HTML AgilityPack) that I want to extract the id of a div(if it has one)

<div id="div1">Street ___________________ </div>
<div id="div2">CAP |__|__|__|__|__| number ______ </div>
<div id="div3">City _____________________ State |__|__|</div>
<div id="div4">City2 ____________________ State2 _____</div>

I have a pattern for extracting underscores __ : [\ _]{3,}

Now if I have a div in front of my underscores I want to extract it, if not I'll get only the underscores.

I have build so far this pattern (<div id(.+?)>(\w)([\ _]{3,}/*))([\ _]{3,})

The first part is build out of 3 groups 1 - a div tag, 2 - a label, 3 - underscores

1 - <div id(.+?)>, 2 - (\w) , 3 - [\ _]{3,}/*

The div with the id div2 will not take the id because it contains non-alfanumeric chars.

Q: What is wrong with my pattern ?

Desired matchs for the 4 divs:

<div id="div1">Street ___________________
______ 
<div id="div3">City _____________________
<div id="div4">City2 ____________________
_____

Upvotes: 0

Views: 82

Answers (2)

Bernhard Barker
Bernhard Barker

Reputation: 55589

  • \w is just a single character, you probably want to say one or more - \w+.

  • /* - zero or more /'s? I don't see where that fits in.

  • One or more not >'s (i.e. [^>]+) is probably a better idea than .+?. .+? will try to stop at the first >, but will continue until it finds a string that matches, i.e.:

    <div id=1>this is not valid</div><div id=2>this is valid___</div>
    

    will match the whole string, instead of just from <div id=2>.

  • As far as I can tell from your question, everything before the underscores should be optional.

Pattern:

(?:(<div id[^>]+>)(\w+))?([\ _]{3,})

C# Test.

Upvotes: 1

xanatos
xanatos

Reputation: 111840

Try something like

string html = @"<div id=""div1"">Street ___________________ </div>
<div id=""div2"">CAP |__|__|__|__|__| number ______ </div>
<div id=""div3"">City _____________________ State |__|__|</div>
<div name=""hello"" id=""div4"">City _____________________ State |__|__|</div>
<div name=""house"">City _____________________ State |__|__|</div>
<div id=""notext""></div>";

var rx = new Regex(@"<div(?:(?: id=""(?<id>[^""]+)"")|[^>])*>(?<content>[^<]*)</div>", 
                   RegexOptions.IgnoreCase);

var matches = rx.Matches(html);

foreach (Match match in matches)
{
    var id = match.Groups["id"];
    var content = match.Groups["content"];

    Console.WriteLine("id present: {0}, id: {1}, text: {2}", 
                      id.Success, 
                      id.ToString(), 
                      content.ToString());
}

if it work I'll explain the regex (that is <div(?:(?: id="(?<id>[^"]+)")|[^>])*>(?<content>[^<]*)</div>)

Upvotes: 1

Related Questions