semicodin
semicodin

Reputation: 19

Strip all ADVERTISING CODE from my HTML?

My skill in coding HTML is slightly above newbie level though my CSS is improving daily so I don't even know if this can be done. Although I have no Python, Php, Ruby, Javascript, Perl, Fortran buzzer! (just want to make sure you're still awake big guy) I am willing to learn. The slice below is the first 970 characters — .003 percent — of the 365,937 characters comprising its one style alone. It is these and other Wall of Advertising Code blocks I aspire to delete:

<style type="text/css">#Ad2, #AdText, #Ad_Top, #Adbanner, #Adfox_Banner, #Ads, #AdvertFieldBottom, #AdvertFieldCenter, #AdvertFieldTop, #Advertisement, #AdvertisingTopLine, #BanHolder28-1, #BannerGBottom, #BannerGCenter, #BannerGIMG, #BannerGTop, #BannerH2Left, #BannerHIMG, #BannerHLeft, #BannerUnderBroChat, #JaboxAdBarOuter, #METABAR_IFRAME, #MarketGidComposite1001, #PopUpWnd, #PopWin, #PopWin_popupsu_notds, #RichBanner_center, #__adIframe, #ad-200, #ad-slides, #ad2, #ad4, #ad7, #adHeadBanner, #adL, #adP, #adWrapper, #ad_help_link, #ad_hide_mask_ad_0, #ad_hide_mask_ad_1, #adbns, #adf_notifiers_wrap, #adsCSS, #advRightBox, #advbroker_place_1, #advbroker_place_10, #advbroker_place_2, #advbroker_place_3, #advbroker_place_4, #advbroker_place_5 { display: none!important; }
#advbroker_place_6, #advbroker_place_7, #advbroker_place_8, #advbroker_place_9, #advertbox, #advertising_floater, #advertisment, #advrich, #advunder-top, #adzerk3, #app-banners, . . .</style>

I frequently save HTML pages for my own private reference and I'd like to know if there are any offline¹ widgets/ apps/ macros/ techniques that I could use to strip

  1. the file's advertising code, and
  2. all non-content data code (scripts, forms, events etc.)

I'd like to keep the visual style of the author's page but remove the bloat and I figure if the towering level of talent on stackoverflow can't find a solution then nobody can. I have rudimentary knowledge of Regular Expressions and with the exception of Notepad++ I am a regular user of the assets below:

Can it be done? Thanks everyone. :)

¹for privacy reasons I'd like to avoid an online service

Upvotes: 1

Views: 1456

Answers (3)

semicodin
semicodin

Reputation: 19

Okay this is crude, but as Wild Beard mentioned there just isn't an easy way to get rid of this ad crap. Use a fixed-pitch/monospace font and a robust text editor with line numbering options (I did this in Textpad but I'm pretty sure Don Ho's FREE Notepad++ could do this as well).

  1. SAVE A BACKUP OF YOUR ORIGINAL!
  2. remove all word wrapping
  3. align all lines of text to the left margin
  4. eliminate all double-or-more vertical spacing

You should now have a large block of text, left-aligned, and single-spaced

  1. at the first character position of each line insert line-numbering, followed by a TAB
  2. zero fill your number column so your digits align

Sorting on the first character you don't want line # 5 to be grouped with line # 50001

  1. visually scroll for the really lengthy lines and begin to experiment with sorting on their positions

What you're doing is grabbing the longest of the advertising lines and isolating them for deletion. Be prepared to do this more than once. And don't sweat getting the document back to its original order. That's why you numbered the lines.

Upvotes: 1

ESP32
ESP32

Reputation: 8728

If you find these strange style definitions in the shadow-root of your browser: This CSS is dynamically added to each website by Adguard Adblocker. The tool sets all kinds of "#banner..." or "#ad..." etc to "display:none !important".

https://chrome.google.com/webstore/detail/adguard-adblocker/

Upvotes: 0

Wild Beard
Wild Beard

Reputation: 2927

Here is a simple proof of concept. You'll still need to determine the read/write to file after removing the elements etc or styles. Fiddle

However, like I mentioned in my comment, this will match #additional-info as well. I did add a check to see if the element was an iframe which should narrow down errors a bit.

var matched_classes = [],
		regex = /(#ad)\w+/gmi,
    style = document.querySelectorAll('style');

style.forEach(function(item) {
	matched_classes = item.innerHTML.match(regex);
});

matched_classes.forEach(function(item) {
	var el = document.getElementById(item.replace('#', ''));
  if ( el != null && el.nodeName === 'IFRAME' ) {
	  el.parentElement.removeChild(el);
  }
});
<style type="text/css">#Ad2, #AdText, #Ad_Top, #Adbanner</style>
<iframe id="Ad2" src="https://www.w3schools.com">

</iframe>

<div id="AdText">Something not removed hopefully.</div>

Edit

As you mentioned in your comment you have no idea how to implement this. There's no simple and easy way to do it. You can get started here on how to create files with javascript but Javascript likely isn't going to be your best bet. From your list of languages in the question Python may be your best bet, sadly, I don't know Python.

You could copy this code I've created and paste it into the bottom of your files, open the file in your browser, view source, copy, and save the new file as it should remove any iframe element with a matching id from a <style> tag. That's a bit tedious. But for someone who doesn't have any experience that may be your best place to start - you know short of writing out the entire solution for you.

<script>
var matched_classes = [],
            regex = /(#ad)\w+/gmi,
        style = document.querySelectorAll('style');

    style.forEach(function(item) {
        matched_classes = item.innerHTML.match(regex);
    });

    matched_classes.forEach(function(item) {
        var el = document.getElementById(item.replace('#', ''));
      if ( el != null && el.nodeName === 'IFRAME' ) {
          el.parentElement.removeChild(el);
      }
    });
</script>

Upvotes: 0

Related Questions