How to normalize fancy-looking unicode string in C#?

Question

I receive from a REST API a text with this kind of style, for example

𝓗𝓸𝔀 𝓽𝓸 𝓻𝓮𝓶𝓸𝓿𝓮 𝓽𝓱𝓲𝓼 𝓯𝓸𝓷𝓽 𝓯𝓻𝓸𝓶 𝓪 𝓼𝓽𝓻𝓲𝓷𝓰?
𝐻𝑜𝓌 𝓉𝑜 𝓇𝑒𝓂𝑜𝓋𝑒 𝓉𝒽𝒾𝓈 𝒻𝑜𝓃𝓉 𝒻𝓇𝑜𝓂 𝒶 𝓈𝓉𝓇𝒾𝓃𝑔?
нσω тσ яємσνє тнιѕ ƒσηт ƒяσм α ѕтяιηg?

But this is not italic or bold or underlined since the type it's string. This kind of text make it failed my Regex ^[a-zA-Z0-9._]*$

I would like to normalize this string received in a standard one in order to make my Regex still valid.

VLRoyrenn · Accepted Answer

You can use Unicode Compatibility normalization forms, which use Unicode's own (lossy) character mappings to transform letter-like characters (among other things) to their simplified equivalents.

In python, for instance:

>>> from unicodedata import normalize
>>> normalize('NFKD','𝓗𝓸𝔀 𝓽𝓸 𝓻𝓮𝓶𝓸𝓿𝓮 𝓽𝓱𝓲𝓼 𝓯𝓸𝓷𝓽 𝓯𝓻𝓸𝓶 𝓪 𝓼𝓽𝓻𝓲𝓷𝓰')
'How to remove this font from a string'

# EDIT: This one wouldn't work
>>> normalize('NFKD','нσω тσ яємσνє тнιѕ ƒσηт ƒяσм α ѕтяιηg?')
'нσω тσ яємσνє тнιѕ ƒσηт ƒяσм α ѕтяιηg?'

Interactive example here.

Note that this only applies to stylistic forms (superscripts, blackletter, fill-width, etc.), so your third example, which uses non-latin characters, can't be decomposed to ASCII.

For C# here's the documentation for String.Normalize, which does just that:

string s1 = "𝓗𝓸𝔀 𝓽𝓸 𝓻𝓮𝓶𝓸𝓿𝓮 𝓽𝓱𝓲𝓼 𝓯𝓸𝓷𝓽 𝓯𝓻𝓸𝓶 𝓪 𝓼𝓽𝓻𝓲𝓷𝓰";
string s2 = s1.Normalize(NormalizationForm.FormKD);

How to normalize fancy-looking unicode string in C#?

Answers (1)

Related Questions