Reputation: 1459
I am trying to build a string of the contents of a webpage, without HTML syntax (probably replace it with a space, so words are not all conjoined) or punctuation.
so say you have the code:
<body>
<h1>Content:</h1>
<p>paragraph 1</p>
<p>paragraph 2</p>
<script> alert("blah blah blah"); </script>
This is some text<br />
....and some more
</body>
I want to return the string:
var content = "Content paragraph 1 paragraph 2 this is some text and this is some more";
any idea how to do this? Thanks.
Upvotes: 2
Views: 8375
Reputation: 2953
You can try using the replace statement below
var str = "..your HTML..";
var content = str.replace(/</?[a-zA-Z0-9]+>|<[a-zA-Z0-9]+\s*/>|\r?\n/g," ");
For the HTML that you have provided above, this will give you the following string in content
Content: paragraph 1 paragraph 2 alert("blah blah blah"); This is some text ....and some more
Upvotes: 0
Reputation: 147353
There is the W3C DOM 3 Core textContent property supported by some browsers, or the MS/HTML5 innerText property supported by other browsers (some support both). Likely the content of the script element is unwanted, so a recursive traverse of the related part of the DOM tree seems best:
// Get the text within an element
// Doesn't do any normalising, returns a string
// of text as found.
function getTextRecursive(element) {
var text = [];
var self = arguments.callee;
var el, els = element.childNodes;
for (var i=0, iLen=els.length; i<iLen; i++) {
el = els[i];
// May need to add other node types here
// Exclude script element content
if (el.nodeType == 1 && el.tagName && el.tagName.toLowerCase() != 'script') {
text.push(self(el));
// If working with XML, add nodeType 4 to get text from CDATA nodes
} else if (el.nodeType == 3) {
// Deal with extra whitespace and returns in text here.
text.push(el.data);
}
}
return text.join('');
}
Upvotes: 2
Reputation: 14447
You'll need a striptags function in javascript for that and a regex to replace consecutive newlines with a single space.
Upvotes: 0
Reputation: 165941
You can use the innerText
property (instead of innerHTML
, which returns the HTML tags as well):
var content = document.getElementsByTagName("body")[0].innerText;
However, note that this will also include new lines, so if you are after exactly what you specified in your question, you would need to remove them.
Upvotes: 3