Manol
Manol

Reputation: 11

grep POSIX regex match Hindi and Japanese

I try the following echo 'ひらが' | grep '[[:alnum:]]' and it matched the string.

but my locale shows English:

LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=en_US.UTF-8
LC_TIME=en_US.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=en_US.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=en_US.UTF-8
LC_NAME=en_US.UTF-8
LC_ADDRESS=en_US.UTF-8
LC_TELEPHONE=en_US.UTF-8
LC_MEASUREMENT=en_US.UTF-8
LC_IDENTIFICATION=en_US.UTF-8
LC_ALL=

shouldn't [:alnum:] match language thats set in locale, whats happening here?

Upvotes: 1

Views: 209

Answers (1)

nhahtdh
nhahtdh

Reputation: 56819

I'm going to post this as a partial answer, since I spent some time digging up these information, and it's too long for a comment.

If you take a look at the locale definition files located at /usr/share/i18n/locales on Linux installation, we can find that the definition of LC_CTYPE (which defines the classification of the characters, as used by ctype.h in C and POSIX character class) in en_US copies the definition from en_GB, and LC_CTYPE definition in en_GB copies the definition from i18n, with minor additions.

Looking at the file i18n, the bulk of LC_CTYPE definition is here. We can see that alpha includes letters of all languages defined in Unicode, with the following comment briefly explaining the rationale:

% The "alpha" class of the "i18n" FDCC-set is reflecting
% the recommendations in TR 10176 annex A
alpha /

Annex A of the standard ISO/IEC TR 10176 standard seems to recommend to use an "extended repertoire for user-defined identifier", which is supposed "to improve understandability for programmers whose native language is not English", though I fail to see how it has anything to do with the definition of alpha character class.

Upvotes: 1

Related Questions