Adam Halasz
Adam Halasz

Reputation: 58301

Regex: Disable Symbols

Is there any way to disable all symbols, punctuations, block elements, geometric shapes and dingbats such like these:

✁ ✂ ✃ ✄ ✆ ✇ ✈ ✉ ✌ ✍ ✎ ✏ ✐ ✑ ✒ ✓ ✔ ✕ ⟻ ⟼ ⟽ ⟾ ⟿ ⟻ ⟼ ⟽ ⟾ ⟿ ▚ ▛ ▜ ▝ ▞ ▟

without writing down all of them in the Regular Expression Pattern, while enable all other normal language characters such like chinese, arabic etc.. such like these:

文化中国 الجزيرة نت

?

I'm building a javascript validation function and my real problem is that I can't use:

[a-zA-Z0-9] 

Because this ignores a lots of languages too not just the symbols.

Upvotes: 0

Views: 1869

Answers (5)

JasonTrue
JasonTrue

Reputation: 19609

This depends on your regex dialect. Unfortunately, probably most existing JavaScript engines don't support Unicode character classes.

In regex engines such as the one in (recent) Perl or .Net, Unicode character classes can be referenced.

\p{L}: any kind of letter from any language. \p{N}: any number symbol from any language (including, as I recall, the Indian and Arabic and CJK number glyphs).

Because Unicode supports composed and decomposed glyphs, you may run into certain complexities: namely, if only decomposed forms exist, it's possible that you might accidentally exclude some diacritic marks in your matching pattern, and you may need to explicitly allow glyphs of the type Mark. You can mitigate this somewhat by using, if I recall correctly, a string that has been normalized using kC normalization (only for characters that have a composed form). In environments that support Unicode well, there's usually a function that allows you to normalize Unicode strings fairly easily (true in Java and .Net, at least).

Edited to add: If you've started down this path, or have considered it, in order to regain some sanity, you may want to experiment with the Unicode Plugin for XRegExp (which will require you to take a dependency on XRegExp).

Upvotes: 2

Greg Hewgill
Greg Hewgill

Reputation: 992847

The Unicode standard divides up all the possible characters into code charts. Each code chart contains related characters. If you want to exclude (or include) only certain classes of characters, you will have to make a suitable list of exclusions (or inclusions). Unicode is big, so this might be a lot of work.

Upvotes: 5

Christoph
Christoph

Reputation: 169543

Take a look at the Unicode Planes. You probably want to exclude everything but planes 0 and 2. After that, it gets ugly as you'll have to exclude a lot of plane 0 on a case-by-case basis.

Upvotes: 1

Kobi
Kobi

Reputation: 138007

JavaScript regular expressions do not have native Unicode support. An alternative to to validate (or sanitize) the string at server site, or to use a non-native regex library. While I've never used it, XRegExp is such a library, and it has a Unicode Plugin.

Upvotes: 1

Peter Bailey
Peter Bailey

Reputation: 105878

Not really.

JavaScript doesn't support Unicode Character Properties. The closest you'll get is excluding ranges by Unicode code point as Greg Hewgill suggested.

For example, to match all of the characters under Mathematical Symbols:

/[\u2190-\u259F]/

Upvotes: 2

Related Questions