Ostap Hnatyuk
Ostap Hnatyuk

Reputation: 1276

Javascript troubles extracting text from HTML

I've made this to try to extract text.

<script type = "text/javascript">
function extractText(node){
    var all = "";
    for (node=node.firstChild;node;node=node.nextSibling){
        alert(node.nodeValue + " = " + node.nodeType);
        if (node.nodeType == 3){
            all += node.nodeValue   
        }
    }
    alert(all);
}
</script>

That is located in the head of an html document. The body looks as such...

<body onload = "extractText(document.body)">
Stuff
<b>text</b>
<script>
var x = 1;
</script>
</body>

The problem is that the alert(all); only shows "Stuff", and it adds a bunch of null things that I don't really understand when doing the alert(node.nodeValue + " = " + node.nodeType);. It says null = 3 a few times. Could anyone tell me why this isn't working properly? Thanks in advance.

Upvotes: 1

Views: 717

Answers (2)

Brad Christie
Brad Christie

Reputation: 101614

If you want the text from the document, you may want to look in to a recursive call. However, if you don't care about children, remove the first if (node.hasChildNodes()){} condition in the following:

function extractText(node){
    var txt = '';
    // recursive exploration and option to uncomment the check for a <script>
    // <script>s will have children as the the actual portion being executed
    // is considered a text node (nodeType===3)
    if (node.hasChildNodes()/* && node.nodeName !== 'SCRIPT'*/){
        for (var c = 0; c < node.childNodes.length; c++){
            txt += extractText(node.childNodes[c]);
        }
    }else if(node.nodeType===3){
        txt += node.textContent;
    }
    return txt;
}
alert(extractText(document.body));

Also, you probably want to grab textContent over nodeValue but that's your call. You can also get more granular and test if the nodeName is a SCRIPT and ignore if (if you so chose) but I'll let you make that determination.

Follow-Up: here's a fiddle you can play with, with the <script> test commented and optional whitespace removal: http://jsfiddle.net/KZuk5/2/

Upvotes: 3

Snuffleupagus
Snuffleupagus

Reputation: 6755

There are different types of nodes - specifically we're looking at two, a text node and an HTML node. A text node is an object and has a property called nodeValue (that you're accessing properly). However, HTML nodes do not have the nodeValue property (or rather, it is set to null).

To get the inner value of an HTML node use .innerHTML.

Upvotes: 2

Related Questions