mtmacdonald
mtmacdonald

Reputation: 15070

Match alphanumeric characters, including latin unicode

I have a working regex that matches ASCII alphanumeric characters:

 string pattern = "^[a-zA-Z0-9]+$";
 Match match = Regex.Match(input, pattern);
 if (match.Success)
 {
   ...

I want to extend this to apply the same concept, but include all latin characters (e.g. å, Ø etc).

I've read about unicode scripts. And I've tried this:

 string pattern = "^[{Latin}0-9]+$";

But it's not matching the patterns I expect. How do I match latin unicode using unicode scripts or an alternative method?

Upvotes: 3

Views: 2289

Answers (3)

revo
revo

Reputation: 48711

Unicode scripts are not supported by .NET regex engine but Unicode blocks are. Having that said, you are able to match all latin characters using below regex:

^[\p{IsBasicLatin}\p{IsLatin-1Supplement}\p{IsLatinExtended-A}\p{IsLatinExtended-B}0-9]+$
  • \p{IsBasicLatin}: U+0000–U+007F
  • \p{IsLatin-1Supplement}: U+0080–U+00FF
  • \p{IsLatinExtended-A}: U+0100–U+017F
  • \p{IsLatinExtended-B}: U+0180–U+024F

or simply use ^[\u0000-\u024F0-9]+$.

Mentioned by @AnthonyFaull you may want to consider matching \p{IsLatinExtendedAdditional} as well which is a named block for U+1E00-U+1EFF that contains 256 additional characters:

[ắẮằẰẵẴẳẲấẤầẦẫẪẩẨảẢạ ẠặẶậẬḁḀ ẚ ḃḂḅḄḇḆ ḉḈ ḋḊḑḐḍḌḓḒḏḎ ẟ ếẾềỀễỄểỂẽẼḝḜḗḖḕḔẻẺẹẸ ệỆḙḘḛḚ ḟḞ ḡḠ ḧḦḣḢḩḨḥḤḫḪẖ ḯḮỉỈịỊḭḬ ḱḰḳḲḵḴ ḷḶḹḸḽḼḻḺ ỻỺ ḿḾṁṀṃṂ ṅṄṇṆṋṊṉṈ ốỐồỒỗỖổỔṍṌṏṎṓṒṑṐỏỎớỚ ờỜỡỠởỞợỢọỌộỘ ṕṔṗṖ ṙṘṛṚṝṜṟṞ ṥṤṧṦṡṠṣṢṩṨẛ ẞ ẜ ẝ ẗṫṪṭṬṱṰṯṮ ṹṸṻṺủỦứỨừỪữỮửỬựỰụỤṳṲ ṷṶṵṴ ṽṼṿṾ ỽỼ ẃẂẁẀẘẅẄẇẆẉẈ ẍẌẋẊ ỳỲẙỹỸẏẎỷỶỵỴ ỿỾ ẑẐẓẒẕẔ]

Upvotes: 5

Michael Schmidt
Michael Schmidt

Reputation: 128

I will use unicode scripts.

As describe by Wikipedia (https://en.wikipedia.org/wiki/Latin_script_in_Unicode), I will use Latin-1 Supplement (00C0-00FF), Latin Extended-A (0100–017F), Latin Extended-B (0180–024F) and your pattern for ASCII alphanumeric characters.

string pattern = "^[a-zA-Z0-9\\u00C0–\\u024F]+$";

Upvotes: 1

Stephane Janicaud
Stephane Janicaud

Reputation: 3627

Use ^[\p{L}\s]+$ to match any unicode character

Or ^[\w\u00c0-\u017e]$ to match any letter plus unicode characters from 00c0 to 017e (use charmap to find unicode characters range you need)

Sample on regex101

Upvotes: 2

Related Questions