DeltaTango
DeltaTango

Reputation: 891

javascript regex for counting words

I got this code for counting the number of words from an html editor.

(providing htmlData has already been set)
var rawWords = htmlData.replace(/<(?:.|\s)*?>/g, '')
                       .replace(/(\r\n|\n|\r)/gm,' ');
var filteredWords = rawWords.replace(/\[([^\]]+)\]/g,'')
                            .replace(/\s+/g, " ")
                            .replace(/^\s+|\s+$/g, "");

From what I understand, the first line removes the html and then removes any returns.

The next line removes anything in brackets (this is to add notes without affecting the word count) and then removes extra spaces

But if I type this:

Apple


Charlie

Tom

It gives me a word count of 6, not 3. Any idea why? I'm not good at regex!!!!

thanks so much

Upvotes: 0

Views: 3051

Answers (3)

Kareem
Kareem

Reputation: 5404

Replacing space with "" doesn't worj this way. try:

 .replace(/[ ]{2,}/gi," ");  /*{2,}=repeated*/
 .replace(/(^\s*)|(\s*$)/gi,"");

instead of:

.replace(/\s+/g, " ")
.replace(/^\s+|\s+$/g, "");

and it should work fine.

Upvotes: 0

ansiart
ansiart

Reputation: 2571

Try this, it's simple, just splits the whitespace/numbers, and counts the array.

window.onload = function() {

    // get string as text
    var text = document.body.innerText;

    // replace all non letters (so we don't count 1 as a word)
    text     = text.replace(/[^a-zA-Z\s]/g, '');

    // split on whitespace
    var words = text.split(/[\s]+/);

    // output -- 52
    console.log('numwords', words, words.length); // numwords 52
}

full example below:

<html>
<head>
<script type="text/javascript">// script</script>
</head>
<body>

a b c d e f g
1 1 1 1 1 1 1




the quick brown fox jumped over the lazy dog.
the quick brown fox jumped over the lazy dog.
the quick brown fox jumped over the lazy dog.<br><br><br><br><br>
the quick brown fox jumped over the lazy dog.
the quick brown fox jumped over the lazy dog.

</body>
</html>

Upvotes: 2

dda
dda

Reputation: 6203

These regexes are ugly and redundant. My advice would be to get the cleaned up HTML by doing something like:

var a=document.createElement('div')
a.innerHTML=htmlData;
textData=a.innerText

then loop through this with a simple regex and increment a counter:

var patt=new RegExp(/(^|\W)(\w+)($|\W)/g);
var counter=0;
var result=patt.exec(textData);
while(result!=null) {
  counter++;
  result=patt.exec(textData);
}

This is very crude (and makes plenty of assumptions that might not work for you) BUT, A/ you'll get in counter the number of "words" [the definition of which you'll have to work on], and B/ you don't have to replace and remove huge amounts of text before getting what you stated you wanted.

HTH

Upvotes: 1

Related Questions