Reputation: 1097
I'm using following regexp
for splitting sentence into words array.
/\b(?![\s.,:;'"])/
It works perfectly for non-unicode sentences, but fails in following sentence.
læseWEB læser teksten på dit website op.
I'm expecting
['læseWEB ', 'læser ', 'teksten ', 'på ', 'dit ', 'website ', 'op.'].
But I'm getting
['l', 'æ', 'se', 'WEB', 'l', 'æ', 'ser', 'teksten', 'p', 'å', 'dit','website', 'op']
I know javascript has issues in unicode manipulation.
I was going to use XRegExp javascript plugin, but I can't find an exact solution I'm looking for.
Upvotes: 0
Views: 406
Reputation: 56829
The definition of\b
in JavaScript is based on the definition of \w
, which is [A-Za-z0-9_]
(only covers ASCII characters).
If you use XRegExp with Unicode Category + Unicode Properties add-on, you can match (instead of splitting) the string with the following code:
XRegExp.matchChain("læseWEB læser teksten på dit website op.", [XRegExp("[\\p{Alphabetic}\\p{Nd}\\{Pc}\\p{M}]+", "g")])
>>> [ "læseWEB", "læser", "teksten", "på", "dit", "website", "op" ]
[\\p{Alphabetic}\\p{Nd}\\{Pc}\\p{M}]
is an incomplete emulation of word character as suggested in the annex C of UTS-18 Unicode Regular Expression. However, it should work for most purposes - it works even if the text uses combining marks instead of single glyph to represent a character.
If you don't want to load extra library, you can take a look at the XRegExp library and pull out the list of code points to build your own RegExp.
Upvotes: 1
Reputation: 11132
\b
is a word border; it matches a location in a string that has a "word character" (character matching [0-9_a-zA-Z]
) on one side and a non-word character ([^0-9_a-zA-Z]
) on the other side. æ
, å
and other characters like it are non technically word characters according to regex, so it can create a border to a word.
For more information see http://www.regular-expressions.info/wordboundaries.html.
Upvotes: 0