Senthil Kumaran
Senthil Kumaran

Reputation: 56931

Testing whitespace using Regex with LOCALE and UNICODE flags in Python

I want to write a test script in Python, where in.

  1. I give a string in locale other than ASCII which has different set of whitespace characters and then use '\s' with re.LOCALE flag to see the output.
  2. I would like to do the complement of it too. I want to \S and see the non-whitespace returned for that LOCALE.

Now, how could I achieve that? Which LOCALE should I choose to see a clear difference in output from ASCII.

# -*- Proper encoding -*-
import re
pat = re.compile('\s*', re.LOCALE)
string = "string"  # Proper Replacement String?
result = pat.match(string)
print result.group(0)

I am using Ubuntu and follow is the my current locale of my shell is.

$locale
LANG=en_SG.UTF-8
LANGUAGE=en_SG:en
LC_CTYPE="en_SG.UTF-8"
LC_NUMERIC="en_SG.UTF-8"
LC_TIME="en_SG.UTF-8"
LC_COLLATE="en_SG.UTF-8"
LC_MONETARY="en_SG.UTF-8"
LC_MESSAGES="en_SG.UTF-8"
LC_PAPER="en_SG.UTF-8"
LC_NAME="en_SG.UTF-8"
LC_ADDRESS="en_SG.UTF-8"
LC_TELEPHONE="en_SG.UTF-8"
LC_MEASUREMENT="en_SG.UTF-8"
LC_IDENTIFICATION="en_SG.UTF-8"
LC_ALL=

BTW, I have less experience with UNICODE or LOCALE aware inputs/outputs (If that matters). All I know is, I can type unicode letters using codepoints on the terminal.

Upvotes: 1

Views: 1037

Answers (1)

Senthil Kumaran
Senthil Kumaran

Reputation: 56931

Answering my own question after digging around the source code.

In Python source code _sre.c

The definition of LOCALE Space is this -

#define SRE_LOC_IS_SPACE(ch) (!((ch) & ~255) ? isspace((ch)) : 0)

And the definition of NON_SPACE category is a negation of space. That's it.

Now, given that definition, we see for the character values higher than 255, the check is not made at all. Is it simple ascii isspace is considered when the LOCALE flag is set. And in effect, re.LOCALE flag has not extra effect on matching of space or non-white space character.

For Unicode, the logic is dealt with in unicodeobject.c and I see it is just a super-set of ascii white space. All ascii whitespace characters are unicode whitespace characters too.

Given this, it impossible to write a program in Python. where you can test for 'exclusive white space character in locale or unicode' excluding the ascii whitespaces.

Upvotes: 1

Related Questions