Reputation: 11
I try the following echo 'ひらが' | grep '[[:alnum:]]'
and it matched the string.
but my locale shows English:
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=en_US.UTF-8
LC_TIME=en_US.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=en_US.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=en_US.UTF-8
LC_NAME=en_US.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_TELEPHONE=en_US.UTF-8
LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
LC_ALL=
shouldn't [:alnum:] match language thats set in locale, whats happening here?
Upvotes: 1
Views: 209
Reputation: 56819
I'm going to post this as a partial answer, since I spent some time digging up these information, and it's too long for a comment.
If you take a look at the locale definition files located at /usr/share/i18n/locales
on Linux installation, we can find that the definition of LC_CTYPE
(which defines the classification of the characters, as used by ctype.h
in C and POSIX character class) in en_US
copies the definition from en_GB
, and LC_CTYPE
definition in en_GB
copies the definition from i18n
, with minor additions.
Looking at the file i18n
, the bulk of LC_CTYPE
definition is here. We can see that alpha
includes letters of all languages defined in Unicode, with the following comment briefly explaining the rationale:
% The "alpha" class of the "i18n" FDCC-set is reflecting
% the recommendations in TR 10176 annex A
alpha /
Annex A of the standard ISO/IEC TR 10176 standard seems to recommend to use an "extended repertoire for user-defined identifier", which is supposed "to improve understandability for programmers whose native language is not English", though I fail to see how it has anything to do with the definition of alpha
character class.
Upvotes: 1