Reputation: 31

Trying to get a whole <div> with regex in c#

I'm trying to scrape a whole div element in c#...

I've tried div class="txt-block"\s*(.+?)(\r\n?|\n)\s*" But it doesn't scrape it whole :( Any ideas? Here is the div.. THX!

    <div class="txt-block" itemprop="creator" itemscope itemtype="http://schema.org/Person"> 
    <h4 class="inline">Writers:</h4>
    <a href="/name/nm1318843/?ref_=tt_ov_wr" itemprop='url'><span class="itemprop"    itemprop="name">Mark Fergus</span></a>               (screenplay), 
    <a href="/name/nm1319757/?ref_=tt_ov_wr" itemprop='url'><span class="itemprop"         
    itemprop="name">Hawk Ostby</span></a>               (screenplay), <a href="fullcredits?ref_=tt_ov_wr#writers" >6 more credits</a>&nbsp;&raquo;
</div>

Upvotes: 1

Answers (2)

Suamere

Reputation: 6258

First, the simplex one:

(?s)<div.*?>(.*?)</div>

Require a particularly named div?

(?s)<div[^>]*?class="txt-block"[^>]*?>(.*?)</div>

Want to save CPU and avoid unnecessary backtracking?

<div[^>]*?class="txt-block"[^>]*?>(([^<]*(?(?!</div>)<))*)</div>

The above assumes you don't have nested DIV items. That's when the whole idea of not using Regex really comes into play. Unless you are using C#.Net. In which case you'd just do this:

(?xm)
    (?>
        <(?<Tagname>div)[^>]*?class="txt-block"[^>]*>
)
(?(Tagname)
    (
        </(?(?!\k'Tagname')(?<-Tagname>))*\k'Tagname'>(?<-Tagname>)
    |
        (?>
            <(?<Tagname>[a-z][^\s>]*)[^>]*>
        )
    |
        [^<]+
    )+?
    (?(Tagname)(?!))
)

Or, the single line version:

(?m)(?><(?<Tagname>div)[^>]*?class="txt-block"[^>]*>)(?(Tagname)(</(?(?!\k'Tagname')(?<-Tagname>))*\k'Tagname'>(?<-Tagname>)|(?><(?<Tagname>[a-z][^\s>]*)[^>]*>)|[^<]+)+?(?(Tagname)(?!)))

The most complex example I posted won't work in Regex Buddy, but will work in any .Net framework. Regex Buddy doesn't support Balancing Groups, which is a .Net flavor.

Upvotes: 6

Jim

Reputation: 469

Parsing HTML with regex is not a good idea. Try finding a library for parsing HTML in c#.

After a quick search I came up with this library: http://htmlagilitypack.codeplex.com/ It seems that this library has all the functionaility you need.

Upvotes: 0

Trying to get a whole &lt;div&gt; with regex in c#

Answers (2)

Related Questions

Trying to get a whole <div> with regex in c#