adeltahir
adeltahir

Reputation: 1097

Javascript splits unicode sentence into words

I'm using following regexp for splitting sentence into words array.

/\b(?![\s.,:;'"])/

It works perfectly for non-unicode sentences, but fails in following sentence.

læseWEB læser teksten på dit website op.

I'm expecting

['læseWEB ', 'læser ', 'teksten ', 'på ', 'dit ', 'website ', 'op.'].

But I'm getting

['l', 'æ', 'se', 'WEB', 'l', 'æ', 'ser', 'teksten', 'p', 'å', 'dit','website', 'op']

I know javascript has issues in unicode manipulation.

I was going to use XRegExp javascript plugin, but I can't find an exact solution I'm looking for.

Upvotes: 0

Views: 406

Answers (2)

nhahtdh
nhahtdh

Reputation: 56829

The definition of\b in JavaScript is based on the definition of \w, which is [A-Za-z0-9_] (only covers ASCII characters).

If you use XRegExp with Unicode Category + Unicode Properties add-on, you can match (instead of splitting) the string with the following code:

XRegExp.matchChain("læseWEB læser teksten på dit website op.", [XRegExp("[\\p{Alphabetic}\\p{Nd}\\{Pc}\\p{M}]+", "g")])
>>> [ "læseWEB", "læser", "teksten", "på", "dit", "website", "op" ]

[\\p{Alphabetic}\\p{Nd}\\{Pc}\\p{M}] is an incomplete emulation of word character as suggested in the annex C of UTS-18 Unicode Regular Expression. However, it should work for most purposes - it works even if the text uses combining marks instead of single glyph to represent a character.

If you don't want to load extra library, you can take a look at the XRegExp library and pull out the list of code points to build your own RegExp.

Upvotes: 1

The Guy with The Hat
The Guy with The Hat

Reputation: 11132

\b is a word border; it matches a location in a string that has a "word character" (character matching [0-9_a-zA-Z]) on one side and a non-word character ([^0-9_a-zA-Z]) on the other side. æ, å and other characters like it are non technically word characters according to regex, so it can create a border to a word.

For more information see http://www.regular-expressions.info/wordboundaries.html.

Upvotes: 0

Related Questions