Reputation: 5379
I receive from a REST API a text with this kind of style, for example
𝓗𝓸𝔀 𝓽𝓸 𝓻𝓮𝓶𝓸𝓿𝓮 𝓽𝓱𝓲𝓼 𝓯𝓸𝓷𝓽 𝓯𝓻𝓸𝓶 𝓪 𝓼𝓽𝓻𝓲𝓷𝓰?
𝐻𝑜𝓌 𝓉𝑜 𝓇𝑒𝓂𝑜𝓋𝑒 𝓉𝒽𝒾𝓈 𝒻𝑜𝓃𝓉 𝒻𝓇𝑜𝓂 𝒶 𝓈𝓉𝓇𝒾𝓃𝑔?
нσω тσ яємσνє тнιѕ ƒσηт ƒяσм α ѕтяιηg?
But this is not italic or bold or underlined since the type it's string.
This kind of text make it failed my Regex ^[a-zA-Z0-9._]*$
I would like to normalize this string received in a standard one in order to make my Regex still valid.
Upvotes: 25
Views: 2826
Reputation: 786
You can use Unicode Compatibility normalization forms, which use Unicode's own (lossy) character mappings to transform letter-like characters (among other things) to their simplified equivalents.
In python, for instance:
>>> from unicodedata import normalize
>>> normalize('NFKD','𝓗𝓸𝔀 𝓽𝓸 𝓻𝓮𝓶𝓸𝓿𝓮 𝓽𝓱𝓲𝓼 𝓯𝓸𝓷𝓽 𝓯𝓻𝓸𝓶 𝓪 𝓼𝓽𝓻𝓲𝓷𝓰')
'How to remove this font from a string'
# EDIT: This one wouldn't work
>>> normalize('NFKD','нσω тσ яємσνє тнιѕ ƒσηт ƒяσм α ѕтяιηg?')
'нσω тσ яємσνє тнιѕ ƒσηт ƒяσм α ѕтяιηg?'
Note that this only applies to stylistic forms (superscripts, blackletter, fill-width, etc.), so your third example, which uses non-latin characters, can't be decomposed to ASCII.
For C# here's the documentation for String.Normalize, which does just that:
string s1 = "𝓗𝓸𝔀 𝓽𝓸 𝓻𝓮𝓶𝓸𝓿𝓮 𝓽𝓱𝓲𝓼 𝓯𝓸𝓷𝓽 𝓯𝓻𝓸𝓶 𝓪 𝓼𝓽𝓻𝓲𝓷𝓰";
string s2 = s1.Normalize(NormalizationForm.FormKD);
Upvotes: 23