Javascript splits unicode sentence into words

Question

I'm using following regexp for splitting sentence into words array.

/\b(?![\s.,:;'"])/

It works perfectly for non-unicode sentences, but fails in following sentence.

læseWEB læser teksten på dit website op.

I'm expecting

['læseWEB ', 'læser ', 'teksten ', 'på ', 'dit ', 'website ', 'op.'].

But I'm getting

['l', 'æ', 'se', 'WEB', 'l', 'æ', 'ser', 'teksten', 'p', 'å', 'dit','website', 'op']

I know javascript has issues in unicode manipulation.

I was going to use XRegExp javascript plugin, but I can't find an exact solution I'm looking for.

nhahtdh · Accepted Answer

The definition of\b in JavaScript is based on the definition of \w, which is [A-Za-z0-9_] (only covers ASCII characters).

If you use XRegExp with Unicode Category + Unicode Properties add-on, you can match (instead of splitting) the string with the following code:

XRegExp.matchChain("læseWEB læser teksten på dit website op.", [XRegExp("[\p{Alphabetic}\p{Nd}\{Pc}\p{M}]+", "g")])
>>> [ "læseWEB", "læser", "teksten", "på", "dit", "website", "op" ]

[\p{Alphabetic}\p{Nd}\{Pc}\p{M}] is an incomplete emulation of word character as suggested in the annex C of UTS-18 Unicode Regular Expression. However, it should work for most purposes - it works even if the text uses combining marks instead of single glyph to represent a character.

If you don't want to load extra library, you can take a look at the XRegExp library and pull out the list of code points to build your own RegExp.

Javascript splits unicode sentence into words

Answers (2)

Related Questions