MrMAG
MrMAG

Reputation: 1264

HtmlAgilityPack - Getting rid of Ads between html comment tags

I need to get rid of the Part between <!-- custom ads -->and <!-- /custom ads --> in this code snippet.

<!-- custom ads -->
<div style="float:left">
  <!-- custom_Forum_Postbit_336x280 -->
  <div id='div-gpt-ad-1526374586789-2' style='width:336px; height:280px;'>
    <script type='text/javascript'>
       googletag.display('div-gpt-ad-1526374586789-2');
    </script>
  </div>
</div>
<div style="float:left; padding-left:20px">
  <!-- custom_Forum_Postbit_336x280_r -->
  <div id='div-gpt-ad-1526374586789-3' style='width:336px; height:280px;'>
    <script type='text/javascript'>
      googletag.display('div-gpt-ad-1526374586789-3');
    </script>
   </div>
</div>
<div class="clear"></div>

 <br>
<!-- /custom ads -->


<!-- google_ad_section_start -->Some Text,<br>
Some More Text...<br>
<!-- google_ad_section_end -->

I can already find the two comments with this xPath //comment()[contains(., 'custom')], but now i'm stuck with how to remove everything, which is in between those "tags".

        foreach (var comment in htmlDoc.DocumentNode.SelectNodes("//comment()[contains(., 'custom')]"))
        {
            MessageBox.Show(comment.OuterHtml);
        }

Any suggestions?

Upvotes: 0

Views: 296

Answers (1)

spender
spender

Reputation: 120480

//find all comment nodes that contain "custom ads"
var nodes = doc.DocumentNode
               .Descendants()
               .OfType<HtmlCommentNode>()
               .Where(c => c.Comment.Contains("custom ads"))
               .ToList();
//create a sequence of pairs of nodes
var nodePairs = nodes
    .Select((node, index) => new {node, index})
    .GroupBy(x => x.index / 2)
    .Select(g => g.ToArray())
    .Select(a => new { startComment = a[0].node, endComment = a[1].node});

foreach (var pair in nodePairs)
{
    var startNode = pair.startComment;
    var endNode = pair.endComment;
    //check they share the same parent or the wheels will fall off
    if(startNode.ParentNode != endNode.ParentNode) throw new Exception();
    //iterate all nodes inbetween
    var currentNode = startNode.NextSibling;
    while(currentNode != endNode)
    {
        //currentNode won't have siblings when we trim it from the doc
        //so grab the nextSibling while it's still attached
        var n = currentNode.NextSibling;
        //and cut out currentNode
        currentNode.Remove();
        currentNode = n;
    }
}

Upvotes: 3

Related Questions