Reputation: 891
I got this code for counting the number of words from an html editor.
(providing htmlData has already been set)
var rawWords = htmlData.replace(/<(?:.|\s)*?>/g, '')
.replace(/(\r\n|\n|\r)/gm,' ');
var filteredWords = rawWords.replace(/\[([^\]]+)\]/g,'')
.replace(/\s+/g, " ")
.replace(/^\s+|\s+$/g, "");
From what I understand, the first line removes the html and then removes any returns.
The next line removes anything in brackets (this is to add notes without affecting the word count) and then removes extra spaces
But if I type this:
Apple
Charlie
Tom
It gives me a word count of 6, not 3. Any idea why? I'm not good at regex!!!!
thanks so much
Upvotes: 0
Views: 3051
Reputation: 5404
Replacing space with "" doesn't worj this way. try:
.replace(/[ ]{2,}/gi," "); /*{2,}=repeated*/
.replace(/(^\s*)|(\s*$)/gi,"");
instead of:
.replace(/\s+/g, " ")
.replace(/^\s+|\s+$/g, "");
and it should work fine.
Upvotes: 0
Reputation: 2571
Try this, it's simple, just splits the whitespace/numbers, and counts the array.
window.onload = function() {
// get string as text
var text = document.body.innerText;
// replace all non letters (so we don't count 1 as a word)
text = text.replace(/[^a-zA-Z\s]/g, '');
// split on whitespace
var words = text.split(/[\s]+/);
// output -- 52
console.log('numwords', words, words.length); // numwords 52
}
full example below:
<html>
<head>
<script type="text/javascript">// script</script>
</head>
<body>
a b c d e f g
1 1 1 1 1 1 1
the quick brown fox jumped over the lazy dog.
the quick brown fox jumped over the lazy dog.
the quick brown fox jumped over the lazy dog.<br><br><br><br><br>
the quick brown fox jumped over the lazy dog.
the quick brown fox jumped over the lazy dog.
</body>
</html>
Upvotes: 2
Reputation: 6203
These regexes are ugly and redundant. My advice would be to get the cleaned up HTML by doing something like:
var a=document.createElement('div')
a.innerHTML=htmlData;
textData=a.innerText
then loop through this with a simple regex and increment a counter:
var patt=new RegExp(/(^|\W)(\w+)($|\W)/g);
var counter=0;
var result=patt.exec(textData);
while(result!=null) {
counter++;
result=patt.exec(textData);
}
This is very crude (and makes plenty of assumptions that might not work for you) BUT, A/ you'll get in counter the number of "words" [the definition of which you'll have to work on], and B/ you don't have to replace and remove huge amounts of text before getting what you stated you wanted.
HTH
Upvotes: 1