木川 炎星
木川 炎星

Reputation: 4093

Regular Expression for accurate word-count using JavaScript

I'm trying to put together a regular expression for a JavaScript command that accurately counts the number of words in a textarea.

One solution I had found is as follows:

document.querySelector("#wordcount").innerHTML = document.querySelector("#editor").value.split(/\b\w+\b/).length -1;

But this doesn't count any non-Latin characters (eg: Cyrillic, Hangul, etc); it skips over them completely.

Another one I put together:

document.querySelector("#wordcount").innerHTML = document.querySelector("#editor").value.split(/\s+/g).length -1;

But this doesn't count accurately unless the document ends in a space character. If a space character is appended to the value being counted it counts 1 word even with an empty document. Furthermore, if the document begins with a space character an extraneous word is counted.

Is there a regular expression I can put into this command that counts the words accurately, regardless of input method?

Upvotes: 19

Views: 43791

Answers (8)

Вадим Булах
Вадим Булах

Reputation: 31

const wordsCount = str.match(/\p{L}+/gu).length

Upvotes: 0

geekdenz
geekdenz

Reputation: 809

For me this gave the best results:

value.split(/\b\W+\b/).length

with

var words = value.split(/\b\W+\b/)

you get all words.

Explanation:

  • \b is a word boundary
  • \W is a NON-word character, capital usually means the negation
  • '+' means 1 or more characters or the prefixed character class

I recommend learning regular expressions. It's a great skill to have because they are so powerful. ;-)

Upvotes: 4

mpjan
mpjan

Reputation: 1850

Try

    value.match(/\w+/g).length;

This will match a string of characters that can be in a word. Whereas something like:

    value.match(/\S+/g).length;

will result in an incorrect count if the user adds commas or other punctuation that is not followed by a space - or adds a comma with a space either side of it.

Upvotes: 4

Sharikul Islam
Sharikul Islam

Reputation: 319

my simple JavaScript library, called FuncJS has a function called "count()" which does exactly what it's called — count words.

For example, say that you have a string full of words, you can simply place it in between the function brackets, like this:

count("How many words are in this string?");

and then call the function, which will then return the number of words. Also, this function is designed to ignore any amount of whitespace, thus giving an accurate result.

To learn more about this function, please read the documentation at http://docs.funcjs.webege.com/count().html and the download link for FuncJS is also on the page.

Hope this helps anyone wanting to do this! :)

Upvotes: 0

morja
morja

Reputation: 8550

Try to count anything that is not whitespace and with a word boundary:

value.split(/\b\S+\b/g).length

You could also try to use unicode ranges, but I am not sure if the following one is complete:

value.split(/[\u0080-\uFFFF\w]+/g).length

Upvotes: 7

albertov
albertov

Reputation: 2334

The correct regexp would be /s+/ in order to discard non-words:

'Lorem ipsum dolor , sit amet'.split(/\S+/g).length
7
'Lorem ipsum dolor , sit amet'.split(/\s+/g).length
6

Upvotes: 1

David Tang
David Tang

Reputation: 93664

This should do what you're after:

value.match(/\S+/g).length;

Rather than splitting the string, you're matching on any sequence of non-whitespace characters.

There's the added bonus of being easily able to extract each word if needed ;)

Upvotes: 42

Valerij
Valerij

Reputation: 27738

you could extend/change you methods like this

document.querySelector("#wordcount").innerHTML = document.querySelector("#editor").value.split(/\b\(.*?)\b/).length -1; if you want to match things like email-addresses as well

and

document.querySelector("#wordcount").innerHTML = document.querySelector("#editor").value.trim().split(/\s+/g).length -1;

also try using \s as its the \w for unicode

source:http://www.regular-expressions.info/charclass.html

Upvotes: 1

Related Questions