jerry
jerry

Reputation: 355

Why [a-z]{3} and [[:lower:]]{3} are different in egrep?

Please try

egrep "^[a-z]{3}$" /usr/share/dict/words

egrep "^[[:lower:]]{3}$" /usr/share/dict/words

The first one returns both uppercase and lowercase words. The second one returns lowercase words only.

Upvotes: 1

Views: 260

Answers (2)

Asaph
Asaph

Reputation: 162801

Are you sure? On my system (OS X Snow Leopard), both commands return exactly the same results; all 3 letter lower case words only.

$ egrep "^[a-z]{3}$" /usr/share/dict/words | wc -l
    1134
$ egrep "^[[:lower:]]{3}$" /usr/share/dict/words | wc -l
    1134

$ egrep "^[[:lower:]]{3}$" /usr/share/dict/words | md5
0a66d5e78cfbe6f9f66d2d90b1053972
$ egrep "^[a-z]{3}$" /usr/share/dict/words | md5
0a66d5e78cfbe6f9f66d2d90b1053972

What system are you using? Perhaps try man egrep and look for a case sensitivity option. The egrep that ships with OSX offers only the opposite -i, --ignore-case ignore case distinctions.

Update:

I've also verified this on a CentOS linux box too:

$ egrep "^[a-z]{3}$" /usr/share/dict/words | wc -l
2044
$ egrep "^[[:lower:]]{3}$" /usr/share/dict/words | wc -l
2044
$ egrep "^[a-z]{3}$" /usr/share/dict/words | md5sum 
480fb21554f9f731adddb0d648157926  -
$ egrep "^[[:lower:]]{3}$" /usr/share/dict/words | md5sum 
480fb21554f9f731adddb0d648157926  -

Update #2:

It appears by your comments that you may be passing the -i or --ignore-case option to egrep. Turn that off to get only the lower case results.

Upvotes: 1

paxdiablo
paxdiablo

Reputation: 881643

It has to do with your locale setting. If you set LC_ALL to C, it should work as expected.

From the egrep manpage under Ubuntu 11.04:

Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set.

For example, in the default C locale, [a-d] is equivalent to [abcd]. Many locales sort characters in dictionary order, and in these locales [a-d] is typically not equivalent to [abcd]; it might be equivalent to [aBbCcDd], for example. To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value C.

You can try the commands from the following transcript to confirm this:

pax$ egrep "^[a-z]{3}$" /usr/share/dict/words | head -5l
AOL
Abe
Ada
Ala
Ali
pax$ LC_ALL=C egrep "^[a-z]{3}$" /usr/share/dict/words | head -5l
ace
act
add
ado
ads

Upvotes: 4

Related Questions