user8179
user8179

Reputation: 61

Regular expression [A-Za-z] seems to not include letter W and w

For some reason, I don't know why, maybe something isn't quite right in my system or in my brain, the regular expression "[A-Z]" doesn't seem to recognise the letter ”W” and "[a-z]" doesn't seem to recognise the letter ”w”. Example:

for x in A a B b C c D d E e F f G g H h I i J j K k L l M m N n O o P p Q q R r S s T t U u V v W w X x Y y Z z; do echo $x | egrep "[A-Za-z]"; done

My output is: A a B b C c D d E e F f G g H h I i J j K k L l M m N n O o P p Q q R r S s T t U u V v X x Y y Z z

As you can see, letters ”W” and ”w” are both missing. Am I the only one? What could possibly cause this? If it's a bug, where do I report it? This happens in bash and zsh and it happens in sed and egrep (and possibly more, I only tested those two), so the problem seems to be about regular expressions in general… :o So… what is going on??

Edit: Someone asked for my locale, so here it is.

$ locale        
LANG=sv_SE.utf8
LC_CTYPE="sv_SE.utf8"
LC_NUMERIC=sv_SE.UTF-8
LC_TIME=sv_SE.UTF-8
LC_COLLATE="sv_SE.utf8"
LC_MONETARY=sv_SE.UTF-8
LC_MESSAGES="sv_SE.utf8"
LC_PAPER=sv_SE.UTF-8
LC_NAME=sv_SE.UTF-8
LC_ADDRESS=sv_SE.UTF-8
LC_TELEPHONE=sv_SE.UTF-8
LC_MEASUREMENT=sv_SE.UTF-8
LC_IDENTIFICATION=sv_SE.UTF-8
LC_ALL=

If this is the problem, then I guess whatever decides what sv_SE.UTF-8 is, is wrong, because the letter ”w” was added to the Swedish alphabet in 2006. Also, if the A-Z interval is dependent on the current locale, shouldn't [A-Ö] work for the whole Swedish alphabet when locale is set to Swedish? It doesn't, it gives an error message. However [[:alpha:]] seems to include all Swedish letters, so I guess I'm happy with that.

Upvotes: 3

Views: 1155

Answers (2)

clawster
clawster

Reputation: 1

This is NOT recommended as a "final solution" but might help someone somehow...

I found out that editing

/usr/share/i18n/locales/sv_SE

and commenting out the last two lines in this section resolved the issue.

% The letter w is normally not present in the Swedish alphabet. It
% exists in some names in Swedish and foreign words, but is accounted
% for as a variant of 'v'.  Words and names with 'w' are in Swedish
% ordered alphabetically among the words and names with 'v'. If two
% words or names are only to be distinguished by 'v' or % 'w', 'v' is
% placed before 'w'.

% &v<<<V<<w<<<W
%<U0057> <S0076>;"<BASE><VRNT1>";"<CAP><MIN>";IGNORE % W
%<U0077> <S0076>;"<BASE><VRNT1>";"<MIN><MIN>";IGNORE % w

and after that regenerating the locale

sudo locale-gen

made things a little better...

Upvotes: 0

rici
rici

Reputation: 241791

Technically speaking, using range expressions such as [a-z] in a Posix regular expression (as with the grep utility) only has specified behaviour in the Posix (C) locale. That means that you really cannot reliably use range expressions in the sv_SE locale (or any other internationalised locale). You can, however, reliably use character classes, such as [[:lower:]], [[:alpha:]], [[:alnum:]], and so on, and that is normally what you should do.

Having said that, I believe that what you are experiencing is indeed a bug in glibc introduced in v2.28, since previous versions of the sv_SE locale correctly placed w in lower-case ranges and W in upper-case ranges. I think the change does not match user expectations, since it will break regex range expressions which previously worked as expected despite having unspecified behaviour.

The problem was reported as a glibc bug about a month ago, and almost immediately closed for lack of documentation; yesterday, I requested that it be reopened. (Update: that bug was requalified as a duplicate of another bug whose eventual solution can only be a comprehensive solution to the underlying design issue. In other words, the glibc team understand that there is a problem but don't hold your breath for a solution.)

I've put a possible replacement sv_SE locale definition file in this repository, in case it proves to be useful to someone. Please don't install it unless you are experiencing problems with the locale definition from glibc.

My excessively long comment in the bug report linked above tries to lay out the problem, which is more a problem of definition than implementation. The essential problem is that it is very difficult (if not impossible) to define a single-character collation order which is completely consistent with a whole-string comparison order. Reading between the lines in the Posix rationale document, it seems clear that a lot of people banged their heads against this particular brick wall without ever managing to come up with a practical portable proposal with implementation consensus. ("As noted above, efforts were made to resolve the differences, but no solution has been found that would be specific enough to allow for portable software while not invalidating existing implementations.")

A well-intentioned cleanup of the various locale definition files resulted in a change to the character ordering in the Swedish locale. It did not alter the string sortation order, so that V and W continue to be sorted as before (that is, as though they were variant spellings of the same letter rather than different letters), and it did not alter the CTYPE definitions, so W and w continue to be letters (and thus match [[:alpha:]]) as they were before. But it did (accidentally, I believe) alter the character order. Before, W followed V and w followed v, so that W matched [U-X] and w matched [u-x]. The change placed both characters after thorn (þ), which means it cannot match any range expression. (Regex range expressions are limited to single-byte codepoints.)


A previous question had been suggested as a duplicate of this question, but I removed the duplicate marker because that question focuses on the wisdom of using [a-z] and not on possible implementation errors, and also because is is about Perl regexes rather than Posix regexes. However, there is a lot of useful information in the answers.

Upvotes: 6

Related Questions