Yogesh Unavane
Yogesh Unavane

Reputation: 265

Remove all web textual content keeping only HTML?

Need to strip all web content from html file keeping only HTML Tags.

Could it be done by Regular Expression OR JavaScript ?

BEFORE :

<html>
<head>
<title>Ask a Question - Stack Overflow</title>
<link rel="shortcut icon" href="//cdn.sstatic.net/stackoverflow/img/favicon.ico">
<script type="text/javascript">
document.write("Code remains un-touched");
</script>
</head>
<body class="ask-page new-topbar">
<div id="first">ONE</div>
<div id="sec">TWO</div>
<div id="third">THREE</div>
</body>
</html>

AFTER :

<html>
<head>
<title></title>
<link rel="shortcut icon" href="//cdn.sstatic.net/stackoverflow/img/favicon.ico">
<script type="text/javascript">
document.write("Code remains un-touched");
</script>
</head>
<body class="ask-page new-topbar">
<div id="first"></div>
<div id="sec"></div>
<div id="third"></div>
</body>
</html>

UPDATE : Need to work with later HTML tags, after stripping web-content, the html should be displayed. In the end, i am interested in the HTML Code.

Upvotes: 0

Views: 67

Answers (2)

Yoshi
Yoshi

Reputation: 54649

A simple recursive function would work:

(function removeTextNodes(el) {
  Array.apply([], el.childNodes).forEach(function (child) {
    if (child.nodeType === 3 && el.nodeName !== 'SCRIPT') {
      // remove the text node
      el.removeChild(child);
    }
    else if (child.nodeType === 1) {
      // call recursive for child nodes
      removeTextNodes(child);
    }
  });
})(document.documentElement);

Quoting Amadan: just use document.documentElement.outerHTML to get the html as a string.

Upvotes: 3

Amadan
Amadan

Reputation: 198314

I'm thinking something like this should work:

$('*').each(function() {
  $(this).contents().filter(function() {
    return this.nodeType == 3 && this.parentNode.nodeName != 'SCRIPT';
  }).remove();
});

Iterate over all elements, see all their child nodes, if they're text nodes and not inside script, kill 'em.

You can test on this very page :P

(Yoshi's jQueryless script is faster, but this was shorter to write :P )

EDIT: nodeName is in caps. Oops.

EDIT for OP's edit: This will subsequently fetch the source code:

$('html')[0].outerHTML

and you can display it using:

$('body').text($('html')[0].outerHTML)

EDIT again: Also, if you want it jQueryless, you can also do document.documentElement.outerHTML instead (which is both faster and nicer). Works with Yoshi's solution, too.

Upvotes: 2

Related Questions