Working with Unicode Blocks in Regex

Question

I am trying to add a feature that works with certain unicode groups from a string. I found this question that suggests the following solution, which does work on the unicodes inside of the stated range:

s = Regex.Replace(s, @"[^\u0000-\u007F]", string.Empty);

This works fine.

In my research, though, I came across the use of unicode blocks, which I find to be far more readable.

InBasic_Latin =  U+0000–U+007F

More often, I saw recommendations pointing people to use the actual codes themselves (\u0000-\u007F) rather than these blocks (InBasic_Latin). I could see the benefit of explicitly declaring a range when you need some subset of that block or a specific unicode, but when you really just want that entire grouping using the block declaration it seems more friendly to readability and even programmability to use the block name instead.

So, generally, my question is why would \u0000–\u007F be considered a better syntax than InBasic_Latin?

Tim Pietzcker · Accepted Answer

It depends on your regex engine, but some (like .NET, Java, Perl) do support Unicode blocks:

if (Regex.IsMatch(subjectString, @"\p{IsBasicLatin}")) {
    // Successful match
}

Others don't (like JavaScript, PCRE, Python, Ruby, R and most others), so you need to spell out those codepoints manually or use an extension like Steve Levithan's XRegExp library for JavaScript.

Working with Unicode Blocks in Regex

Answers (1)

Related Questions