Reputation: 1459

javascript HTML from document.body.innerHTML

I am trying to build a string of the contents of a webpage, without HTML syntax (probably replace it with a space, so words are not all conjoined) or punctuation.

so say you have the code:

    <body>
    <h1>Content:</h1>
    <p>paragraph 1</p>
    <p>paragraph 2</p>

    <script> alert("blah blah blah"); </script>

    This is some text<br />
    ....and some more
    </body>

I want to return the string:

    var content = "Content paragraph 1 paragraph 2 this is some text and this is some more";

any idea how to do this? Thanks.

Upvotes: 2

Answers (4)

Jophin Joseph

Reputation: 2953

You can try using the replace statement below

var str = "..your HTML..";
var content = str.replace(/</?[a-zA-Z0-9]+>|<[a-zA-Z0-9]+\s*/>|\r?\n/g," ");

For the HTML that you have provided above, this will give you the following string in content

   Content:   paragraph 1   paragraph 2    alert("blah blah blah");   This is some text  ....and some more

Upvotes: 0

RobG

Reputation: 147353

There is the W3C DOM 3 Core textContent property supported by some browsers, or the MS/HTML5 innerText property supported by other browsers (some support both). Likely the content of the script element is unwanted, so a recursive traverse of the related part of the DOM tree seems best:

// Get the text within an element
// Doesn't do any normalising, returns a string
// of text as found.
function getTextRecursive(element) {
  var text = [];
  var self = arguments.callee;
  var el, els = element.childNodes;

  for (var i=0, iLen=els.length; i<iLen; i++) {
    el = els[i];

    // May need to add other node types here
    // Exclude script element content
    if (el.nodeType == 1 && el.tagName && el.tagName.toLowerCase() != 'script') {
      text.push(self(el));

    // If working with XML, add nodeType 4 to get text from CDATA nodes
    } else if (el.nodeType == 3) {

      // Deal with extra whitespace and returns in text here.
      text.push(el.data);
    }
  }
  return text.join('');
}

Upvotes: 2

ChrisR

Reputation: 14447

You'll need a striptags function in javascript for that and a regex to replace consecutive newlines with a single space.

Upvotes: 0

James Allardice

Reputation: 165941

You can use the innerText property (instead of innerHTML, which returns the HTML tags as well):

var content = document.getElementsByTagName("body")[0].innerText;

However, note that this will also include new lines, so if you are after exactly what you specified in your question, you would need to remove them.

Upvotes: 3

javascript HTML from document.body.innerHTML

Answers (4)

Related Questions