Reputation: 950
This must have been a very generic question but I have not come across any concrete or stable solution for this.
I just want to fetch the number of words in a web page but across all the browsers. My current implementation is
var body = top.document.body;
if(body) {
var content = body.innerText || body.textContent;
content = content.replace(/\n/ig,' ');
content = content.replace(/\s+/gi,' ');
content = content.replace(/(^\s|\s$)/gi,'');
if(!body.innerText) {
content = content.replace(/<script/gi,'');
}
console.log(content);
console.log(content.split(' ').length);
}
This works well but it does not work with some Firefox browsers as innerText does not work on Firefox.
If I use textContent then it displays the contents of JS tags too if present. Eg if a web page content is
<body>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
<script type="text/javascript">
console.log('Hellow World');
var some = "some";
var two = "two";
var three = "three";
</script>
<h1 style="text-align:center">Static content from Nginx</h1>
<div>
This is a
static.
<div>
This is a
static.
</div>
</div>
</body>
Then textContent will have JS code too in the content which will give me wrong word count.
What is the concrete solution that can work across any environment.
PS: No JQuery
Upvotes: 0
Views: 2491
Reputation: 950
Thank you so much for giving such a helpful answers. I found this approach to use if the innerText is not defined in a browser. And the result that we get is very much similar to innerText. Hence I think it will be consistent across all the browsers.
All of you please look into it and let me know if this answer can be accepted. And let me know if you guys find any discrepancy in this method I am using.
function getWordCount() {
try {
var body = top.document.querySelector("body");
if (body) {
var content = body.innerText || getInnerText(top.document.body, top);
content = content.replace(/\n/ig, ' ');
var wordCount = content.match(/\S+/gi).length;
return wordCount;
}
} catch (e) {
processError("getWordCount", e);
}
}
function getInnerText(el, win) {
try {
win = win || window;
var doc = win.document,
sel, range, prevRange, selString;
if (win.getSelection && doc.createRange) {
sel = win.getSelection();
if (sel.rangeCount) {
prevRange = sel.getRangeAt(0);
}
range = doc.createRange();
range.selectNodeContents(el);
sel.removeAllRanges();
sel.addRange(range);
selString = sel.toString();
sel.removeAllRanges();
prevRange && sel.addRange(prevRange);
} else if (doc.body.createTextRange) {
range = doc.body.createTextRange();
range.moveToElementText(el);
range.select();
}
return selString;
} catch (e) {
processError('getInnerText', e);
}
}
The result that I am getting is same as that of innerText and is more accurate than using regex, or removing tags etc.
Please give me ur views on this.
Upvotes: 0
Reputation: 2411
Ok, you have there two problems:
innerText
I'd go with:
var text = document.body[('innerText' in document.body) ? 'innerText' : 'textContent'];
That, to prefer innerText over textContent.
dandavis offers a neat solution to that:
function noscript(strCode){
var html = $(strCode.bold());
html.find('script').remove();
return html.html();
}
And a non-jQuery solution:
function noscript(strCode){
return strCode.replace(/<script.*?>.*?<\/script>/igm, '')
}
A function that will turn the string into a "fake" html document, strip its script tags and return the raw result.
Of course, you may improve the function to remove also <style> tags and others.
Your method to do the job is alright, but still, I think that a simple regex would do the job much better. You can count the words in a string using:
str.match(/\S+/g).length;
Final result should look like
var body = top.document.body;
if(body) {
var content = document.body[('innerText' in document.body) ? 'innerText' : 'textContent'];
content = noscript(content);
alert(content.match(/\S+/g).length);
}
Upvotes: 1
Reputation: 1249
What about hidden/invisible/overlayed blocks? do you want to count words inside all of it? what about images (alt tag of image)
if you want to count all - just strip tags and count test of all rest blocks. smth like that $('body :not(script)').text()
Upvotes: 0