Reputation: 855
I am trying to get the inner text of HTML string, using a JS function(the string is passed as an argument). Here is the code:
function extractContent(value) {
var content_holder = "";
for (var i = 0; i < value.length; i++) {
if (value.charAt(i) === '>') {
continue;
while (value.charAt(i) != '<') {
content_holder += value.charAt(i);
}
}
}
console.log(content_holder);
}
extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>");
The problem is that nothing gets printed on the console(*content_holder* stays empty)
. I think the problem is caused by the ===
operator.
Upvotes: 73
Views: 191172
Reputation: 2298
Based on Rick Hitchcock answer AND KevBot's, this is how I found the best way to do it :
function getTextLoop(element: HTMLElement | ChildNode) {
const texts = [];
Array.from(element.childNodes).forEach((node) => {
if (node.nodeType === 3) {
texts.push(node.textContent.trim());
} else {
texts.push(...getTextLoop(node));
}
});
return texts;
}
function innerText(element: HTMLElement) {
return getTextLoop(element).join(" ");
}
export function extractContent(s, space) {
var span = document.createElement("span");
span.innerHTML = s;
if (space) {
span.innerHTML = innerText(span);
}
return [span.textContent || span.innerText].toString().replace(/ +/g, " ");
}
Example :
extractContent("<div>foo<div>bar</div></div>", true); // foo bar
Upvotes: 0
Reputation: 1145
Use match()
function to bring out HTML tags
const text = `<div>Hello World</div>`;
console.log(text.match(/<[^>]*?>/g));
Upvotes: 0
Reputation: 15830
This will use the jsdom
library, since node.js doesn't have dom features as in browser.
import * as jsdom from "jsdom";
const html = "<h1>Testing<h1>";
const text = new jsdom.JSDOM(html).window.document.textContent;
console.log(text);
Upvotes: 3
Reputation: 45
Using jQuery, in jQuery we can add comma seperated tags.
var readableText = [];
$("p, h1, h2, h3, h4, h5, h6").each(function(){
readableText.push( $(this).text().trim() );
})
console.log( readableText.join(' ') );
Upvotes: -1
Reputation:
One line (more precisely, one statement) version:
function extractContent(html) {
return new DOMParser()
.parseFromString(html, "text/html")
.documentElement.textContent;
}
Upvotes: 87
Reputation: 1067
textContext is a very good technique for achieving desired results but sometimes we don't want to load DOM. So simple workaround will be following regular expression:
let htmlString = "<p>Hello</p><a href='http://w3c.org'>W3C</a>"
let plainText = htmlString.replace(/<[^>]+>/g, '');
Upvotes: 43
Reputation: 35680
Create an element, store the HTML in it, and get its textContent
:
function extractContent(s) {
var span = document.createElement('span');
span.innerHTML = s;
return span.textContent || span.innerText;
};
alert(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>"));
Here's a version that allows you to have spaces between nodes, although you'd probably want that for block-level elements only:
function extractContent(s, space) {
var span= document.createElement('span');
span.innerHTML= s;
if(space) {
var children= span.querySelectorAll('*');
for(var i = 0 ; i < children.length ; i++) {
if(children[i].textContent)
children[i].textContent+= ' ';
else
children[i].innerText+= ' ';
}
}
return [span.textContent || span.innerText].toString().replace(/ +/g,' ');
};
console.log(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>. Nice to <em>see</em><strong><em>you!</em></strong>"));
console.log(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>. Nice to <em>see</em><strong><em>you!</em></strong>",true));
Upvotes: 131
Reputation: 1456
Try This:-
<!DOCTYPE html>
<html>
<body>
<script type="text/javascript">
function extractContent(value){
var div = document.createElement('div')
div.innerHTML=value;
var text= div.textContent;
return text;
}
window.onload=function()
{
alert(extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>"));
};
</script>
</body>
</html>
Upvotes: 2
Reputation: 537
use this regax for remove html tags and store only the inner text in html
it shows the HelloW3c only check it
var content_holder = value.replace(/<(?:.|\n)*?>/gm, '');
Upvotes: 8
Reputation: 1958
You could temporarily write it out to a block level element that is positioned off the page .. some thing like this:
HTML:
<div id="tmp" style="position:absolute;top:-400px;left:-400px;">
</div>
JavaScript:
<script type="text/javascript">
function extractContent(value){
var div=document.getElementById('tmp');
div.innerHTML=value;
console.log(div.children[0].innerHTML);//console out p
}
extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>");
</script>
Upvotes: 0
Reputation: 83
you need array to hold values
function extractContent(value) {
var content_holder = new Array();
for(var i=0;i<value.length;i++) {
if(value.charAt(i) === '>') {
continue;
while(value.charAt(i) != '<') {
content_holder.push(value.charAt(i));
console.log(content_holder[i]);
}
}
}
}extractContent("<p>Hello</p><a href='http://w3c.org'>W3C</a>");
Upvotes: -2