Reputation: 31
I'm trying to scrape a whole div element in c#...
I've tried div class="txt-block"\s*(.+?)(\r\n?|\n)\s*"
But it doesn't scrape it whole :(
Any ideas?
Here is the div.. THX!
<div class="txt-block" itemprop="creator" itemscope itemtype="http://schema.org/Person">
<h4 class="inline">Writers:</h4>
<a href="/name/nm1318843/?ref_=tt_ov_wr" itemprop='url'><span class="itemprop" itemprop="name">Mark Fergus</span></a> (screenplay),
<a href="/name/nm1319757/?ref_=tt_ov_wr" itemprop='url'><span class="itemprop"
itemprop="name">Hawk Ostby</span></a> (screenplay), <a href="fullcredits?ref_=tt_ov_wr#writers" >6 more credits</a> »
</div>
Upvotes: 1
Views: 1864
Reputation: 6258
First, the simplex one:
(?s)<div.*?>(.*?)</div>
Require a particularly named div?
(?s)<div[^>]*?class="txt-block"[^>]*?>(.*?)</div>
Want to save CPU and avoid unnecessary backtracking?
<div[^>]*?class="txt-block"[^>]*?>(([^<]*(?(?!</div>)<))*)</div>
The above assumes you don't have nested DIV items. That's when the whole idea of not using Regex really comes into play. Unless you are using C#.Net. In which case you'd just do this:
(?xm)
(?>
<(?<Tagname>div)[^>]*?class="txt-block"[^>]*>
)
(?(Tagname)
(
</(?(?!\k'Tagname')(?<-Tagname>))*\k'Tagname'>(?<-Tagname>)
|
(?>
<(?<Tagname>[a-z][^\s>]*)[^>]*>
)
|
[^<]+
)+?
(?(Tagname)(?!))
)
Or, the single line version:
(?m)(?><(?<Tagname>div)[^>]*?class="txt-block"[^>]*>)(?(Tagname)(</(?(?!\k'Tagname')(?<-Tagname>))*\k'Tagname'>(?<-Tagname>)|(?><(?<Tagname>[a-z][^\s>]*)[^>]*>)|[^<]+)+?(?(Tagname)(?!)))
The most complex example I posted won't work in Regex Buddy, but will work in any .Net framework. Regex Buddy doesn't support Balancing Groups, which is a .Net flavor.
Upvotes: 6
Reputation: 469
Parsing HTML with regex is not a good idea. Try finding a library for parsing HTML in c#.
After a quick search I came up with this library: http://htmlagilitypack.codeplex.com/ It seems that this library has all the functionaility you need.
Upvotes: 0