Luigi Saggese
Luigi Saggese

Reputation: 5379

How to normalize fancy-looking unicode string in C#?

I receive from a REST API a text with this kind of style, for example

But this is not italic or bold or underlined since the type it's string. This kind of text make it failed my Regex ^[a-zA-Z0-9._]*$

I would like to normalize this string received in a standard one in order to make my Regex still valid.

Upvotes: 25

Views: 2826

Answers (1)

VLRoyrenn
VLRoyrenn

Reputation: 786

You can use Unicode Compatibility normalization forms, which use Unicode's own (lossy) character mappings to transform letter-like characters (among other things) to their simplified equivalents.

In python, for instance:

>>> from unicodedata import normalize
>>> normalize('NFKD','𝓗𝓸𝔀 𝓽𝓸 𝓻𝓮𝓶𝓸𝓿𝓮 𝓽𝓱𝓲𝓼 𝓯𝓸𝓷𝓽 𝓯𝓻𝓸𝓶 𝓪 𝓼𝓽𝓻𝓲𝓷𝓰')
'How to remove this font from a string'

# EDIT: This one wouldn't work
>>> normalize('NFKD','нσω тσ яємσνє тнιѕ ƒσηт ƒяσм α ѕтяιηg?')
'нσω тσ яємσνє тнιѕ ƒσηт ƒяσм α ѕтяιηg?'

Interactive example here.

Note that this only applies to stylistic forms (superscripts, blackletter, fill-width, etc.), so your third example, which uses non-latin characters, can't be decomposed to ASCII.

For C# here's the documentation for String.Normalize, which does just that:

string s1 = "𝓗𝓸𝔀 𝓽𝓸 𝓻𝓮𝓶𝓸𝓿𝓮 𝓽𝓱𝓲𝓼 𝓯𝓸𝓷𝓽 𝓯𝓻𝓸𝓶 𝓪 𝓼𝓽𝓻𝓲𝓷𝓰";
string s2 = s1.Normalize(NormalizationForm.FormKD);

Upvotes: 23

Related Questions