Reputation: 950

Fetching word count in a web page

This must have been a very generic question but I have not come across any concrete or stable solution for this.

I just want to fetch the number of words in a web page but across all the browsers. My current implementation is

var body = top.document.body;
if(body) {
    var content = body.innerText || body.textContent;
    content = content.replace(/\n/ig,' ');
    content = content.replace(/\s+/gi,' ');
    content = content.replace(/(^\s|\s$)/gi,'');
    if(!body.innerText) {
        content = content.replace(/<script/gi,'');
    }
    console.log(content);
    console.log(content.split(' ').length);
}

This works well but it does not work with some Firefox browsers as innerText does not work on Firefox.

If I use textContent then it displays the contents of JS tags too if present. Eg if a web page content is

<body>
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
    <script type="text/javascript"> 
    console.log('Hellow World');
    var some = "some";
    var two = "two";
    var three = "three";
    </script>

    <h1 style="text-align:center">Static content from Nginx</h1>
    <div>
        This is a 
            static.
            <div>
                This is a 
                    static.
            </div>
    </div>
</body>

Then textContent will have JS code too in the content which will give me wrong word count.

What is the concrete solution that can work across any environment.

PS: No JQuery

Upvotes: 0

Answers (3)

Juvenik

Reputation: 950

Thank you so much for giving such a helpful answers. I found this approach to use if the innerText is not defined in a browser. And the result that we get is very much similar to innerText. Hence I think it will be consistent across all the browsers.

All of you please look into it and let me know if this answer can be accepted. And let me know if you guys find any discrepancy in this method I am using.

function getWordCount() {
    try {
        var body = top.document.querySelector("body");
        if (body) {
            var content = body.innerText || getInnerText(top.document.body, top);
            content = content.replace(/\n/ig, ' ');
            var wordCount = content.match(/\S+/gi).length;
            return wordCount;
        }
    } catch (e) {
        processError("getWordCount", e);
    }
}


function getInnerText(el, win) {
    try {
        win = win || window;
        var doc = win.document,
            sel, range, prevRange, selString;
        if (win.getSelection && doc.createRange) {
            sel = win.getSelection();
            if (sel.rangeCount) {
                prevRange = sel.getRangeAt(0);
            }
            range = doc.createRange();
            range.selectNodeContents(el);
            sel.removeAllRanges();
            sel.addRange(range);
            selString = sel.toString();
            sel.removeAllRanges();
            prevRange && sel.addRange(prevRange);
        } else if (doc.body.createTextRange) {
            range = doc.body.createTextRange();
            range.moveToElementText(el);
            range.select();
        }
        return selString;
    } catch (e) {
        processError('getInnerText', e);
    }
}

The result that I am getting is same as that of innerText and is more accurate than using regex, or removing tags etc.

Please give me ur views on this.

Upvotes: 0

Yotam Salmon

Reputation: 2411

Ok, you have there two problems:

Cross-browser `innerText`

I'd go with:

var text = document.body[('innerText' in document.body) ? 'innerText' : 'textContent'];

That, to prefer innerText over textContent.

Stripping result of <script> tags.

dandavis offers a neat solution to that:

function noscript(strCode){
    var html = $(strCode.bold()); 
    html.find('script').remove();
    return html.html();
}

And a non-jQuery solution:

function noscript(strCode){
    return strCode.replace(/<script.*?>.*?<\/script>/igm, '')
}

A function that will turn the string into a "fake" html document, strip its script tags and return the raw result.

Of course, you may improve the function to remove also <style> tags and others.

Counting letters

Your method to do the job is alright, but still, I think that a simple regex would do the job much better. You can count the words in a string using:

str.match(/\S+/g).length;

Finally

Final result should look like

var body = top.document.body;
if(body) {
    var content = document.body[('innerText' in document.body) ? 'innerText' : 'textContent'];
    content = noscript(content);
    alert(content.match(/\S+/g).length);
}

Upvotes: 1

Chauskin Rodion

Reputation: 1249

What about hidden/invisible/overlayed blocks? do you want to count words inside all of it? what about images (alt tag of image)

if you want to count all - just strip tags and count test of all rest blocks. smth like that $('body :not(script)').text()

Upvotes: 0

Fetching word count in a web page

Answers (3)

Cross-browser innerText

Stripping result of <script> tags.

Counting letters

Finally

Related Questions

Cross-browser `innerText`