Reputation: 31
I have to retrieve several div section (of specific class name "row ") with it's content, and additionally find all anchor tags (link urls) (with class "underline red bold"). Shortly speaing : get section of:
<div class = "row ">
... (divs, tags ...)
<a class="underline red bold" href="/searchClickThru?pid=prod56534895&q=&rpos=109181&rpp=10&_dyncharset=UTF-8&sort=&url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p">
and collections of urls
string[] urls = {"/searchClickThru?pid=prod56534895&q=&rpos=109181&rpp=10&_dyncharset=UTF-8&sort=&url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p"}
the entire page looks like that:
<html>
... a lot of stuff
<div class="row ">
<div class="photo">
<a rel="nofollow" href="/searchClickThru?pid=prod56534895&q=&rpos=109181&rpp=10&_dyncharset=UTF-8&sort=&url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p">
<img alt="alt msg" src="/b/s/b9/03/b9038292d147a582add07ee1f0607827.jpg">
</a>
</div>
<div class="desc">
<div class="l1">
<div class="icons">
</div>
<table cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td>
<div class="fleft">
<a class="underline red bold" href="/searchClickThru?pid=prod56534895&q=&rpos=109181&rpp=10&_dyncharset=UTF-8&sort=&url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p">
Culture And Gender <br>Intimate Relation</a>
</div>
<div class="fleft">
</div>
</td>
</tr>
</tbody>
</table>
</div>
<div class="l2">
<div>
</div>
<div>
<div class="but">
</div>
</div>
</div>
<div class="l3">
Long description
<a class="underlinepix_red no_wrap" rel="nofollow" href="/searchClickThru?pid=prod56534895&q=&rpos=109181&rpp=10&_dyncharset=UTF-8&sort=&url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p">
more<img alt="" src="/b/img/arr_red_sm.gif">
</a>
</div>
</div>
</div>
<div class="omit"></div>
<div class="row ">
<div class="photo">
<a rel="nofollow" href="/searchClickThru?pid=prod56534895&q=&rpos=109181&rpp=10&_dyncharset=UTF-8&sort=&url=/culture-and-gender-intimate-relation-ksiazka,prod56534899,p">
<img alt="alt msg" src="/b/s/b9/03/b9038292d147a582add07ee1f06078222.jpg">
</a>
</div>
<div class="desc">
<div class="l1">
<div class="icons">
</div>
<table cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td>
<div class="fleft">
<a class="underline red bold" href="/searchClickThru?pid=prod56534895&q=&rpos=109181&rpp=10&_dyncharset=UTF-8&sort=&url=/culture-and-gender-intimate-relation-ksiazka,prod5653489225,p">
Culture And Gender <br>Intimate Relation</a>
</div>
<div class="fleft">
</div>
</td>
</tr>
</tbody>
</table>
</div>
<div class="l2">
<div>
</div>
<div>
<div class="but">
</div>
</div>
</div>
<div class="l3">
Long description
<a class="underlinepix_red no_wrap" rel="nofollow" href="/searchClickThru?pid=prod56534895&q=&rpos=109181&rpp=10&_dyncharset=UTF-8&sort=&url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p">
more<img alt="" src="/b/img/arr_red_sm.gif">
</a>
</div>
</div>
</div>
Can anybody help me to create suitable reg ex?
Upvotes: 2
Views: 5058
Reputation: 8301
Alternatively, if you've managed to get into LINQ and like the power of LINQ, there appears to be a LINQ-to-HTML Library available for download. I haven't tried it yet, so I cannot speak to its ability.
Upvotes: 1
Reputation: 40395
Is it NECESSARY to use regular expressions? If not, then why don't you use an HTML parser like Html Agility Pack... it will be MUCH easier to get what you want if you use a parser instead of regular expressions.
Upvotes: 0
Reputation: 27581
Check out the HTML Agility Pack
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
Upvotes: 0
Reputation: 24443
The answer to this question is roughly the same as the answer to this question:
RegEx match open tags except XHTML self-contained tags
Upvotes: 1
Reputation: 25593
Regular expressions are not well suited for this.
Due to the nested nature of HTML, a regular expression that does what you ask would be very (very very) long and complicated. Use a HTML Parser instead.
Upvotes: 15