mpen
mpen

Reputation: 283043

How to match high-value unicode characters using a regex?

Specifically, I want to match the range [#x10000-#xEFFFF]. AFAIK, the \u escape sequences only accept 4 hex digits, not 5. Is there a way to match higher values?

Upvotes: 2

Views: 763

Answers (3)

Scott Wager
Scott Wager

Reputation: 868

5 hex digits require surrogate pairs. Use the es6 'u' (unicode) flag to create surrogate pair aware regex

// incorrect
'š’³'.match(/[š’³š’“]/)

// correct
'š’³'.match(/[š’³š’“]/u

Upvotes: 2

Ted Hopp
Ted Hopp

Reputation: 234847

Internally, JavaScript uses UCS-2, which is limited to the base plane. For higher-range characters, you will have to use surrogate pairs. For instance, to find U+13FFA, you can match \uD80F\uDFFA.

More details can be found here.

Unfortunately, this doesn't work well within character classes in a regex. With BMP characters, you can do things like /[a-z]/. You can't do that with higher-range characters because JavaScript doesn't understand that surrogate pairs should be treated as a unit. You may be able to hunt around for third-party libraries that deal with this. Sadly, I don't know of any to recommend. This one might be worth a look. I've never used it, so I cannot attest to it's quality.

P.S. You may find this shim useful for dealing with higher-order characters generally.

Upvotes: 4

mariozski
mariozski

Reputation: 1134

Maybe something like this?

var regex = /#x[1-9a-eA-E][0-9a-fA-F]{4}/g;

console.log(regex.test("#x03FFA")); // false
console.log(regex.test("#x13FFA")); // true

http://jsbin.com/awidew/1

mz

Upvotes: 0

Related Questions