czesio
czesio

Reputation: 31

C# RegEx - find html tags (div and anchor)

I have to retrieve several div section (of specific class name "row ") with it's content, and additionally find all anchor tags (link urls) (with class "underline red bold"). Shortly speaing : get section of:

<div class = "row ">
 ... (divs, tags ...)
<a class="underline red bold" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p">

and collections of urls

string[] urls = {"/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p"}

the entire page looks like that:

<html>

... a lot of stuff

<div class="row ">

  <div class="photo">
    <a rel="nofollow" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p">
      <img alt="alt msg" src="/b/s/b9/03/b9038292d147a582add07ee1f0607827.jpg">                 
 </a>
  </div>

  <div class="desc">
    <div class="l1">
      <div class="icons">
      </div>

      <table cellspacing="0" cellpadding="0" border="0">
        <tbody>
          <tr>
            <td>
              <div class="fleft">
                <a class="underline red bold" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p">
                  Culture And Gender   <br>Intimate Relation</a>
              </div>

              <div class="fleft">

              </div>
            </td>
          </tr>
        </tbody>
      </table>
    </div>
    <div class="l2">

      <div>
      </div>
      <div>
        <div class="but">
        </div>
      </div>
    </div>
    <div class="l3">
      Long description
      <a class="underlinepix_red no_wrap" rel="nofollow" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p">
        more<img alt="" src="/b/img/arr_red_sm.gif">
  </a>
    </div>
  </div>
</div>

<div class="omit"></div>

<div class="row ">

  <div class="photo">
    <a rel="nofollow" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534899,p">
      <img alt="alt msg" src="/b/s/b9/03/b9038292d147a582add07ee1f06078222.jpg">                    
 </a>
  </div>

  <div class="desc">
    <div class="l1">
      <div class="icons">
      </div>

      <table cellspacing="0" cellpadding="0" border="0">
        <tbody>
          <tr>
            <td>
              <div class="fleft">
                <a class="underline red bold" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod5653489225,p">
                  Culture And Gender   <br>Intimate Relation</a>
              </div>

              <div class="fleft">

              </div>
            </td>
          </tr>
        </tbody>
      </table>
    </div>
    <div class="l2">

      <div>
      </div>
      <div>
        <div class="but">
        </div>
      </div>
    </div>
    <div class="l3">
      Long description
      <a class="underlinepix_red no_wrap" rel="nofollow" href="/searchClickThru?pid=prod56534895&amp;q=&amp;rpos=109181&amp;rpp=10&amp;_dyncharset=UTF-8&amp;sort=&amp;url=/culture-and-gender-intimate-relation-ksiazka,prod56534895,p">
        more<img alt="" src="/b/img/arr_red_sm.gif">
  </a>
    </div>
  </div>
</div>

Can anybody help me to create suitable reg ex?

Upvotes: 2

Views: 5058

Answers (5)

Pretzel
Pretzel

Reputation: 8301

Alternatively, if you've managed to get into LINQ and like the power of LINQ, there appears to be a LINQ-to-HTML Library available for download. I haven't tried it yet, so I cannot speak to its ability.

Upvotes: 1

Kiril
Kiril

Reputation: 40395

Is it NECESSARY to use regular expressions? If not, then why don't you use an HTML parser like Html Agility Pack... it will be MUCH easier to get what you want if you use a parser instead of regular expressions.

Upvotes: 0

Matthew Vines
Matthew Vines

Reputation: 27581

Check out the HTML Agility Pack

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Upvotes: 0

GWLlosa
GWLlosa

Reputation: 24443

The answer to this question is roughly the same as the answer to this question:

RegEx match open tags except XHTML self-contained tags

Upvotes: 1

Jens
Jens

Reputation: 25593

Regular expressions are not well suited for this.

Due to the nested nature of HTML, a regular expression that does what you ask would be very (very very) long and complicated. Use a HTML Parser instead.

Upvotes: 15

Related Questions