Reputation: 3741

How to remove all lowercase characters from a string using AWK?

Please note that I need this answer in AWK.

How can I remove all lowercase characters from some awk variable? I tried calling gsub:

gsub(/[a-z]+/,"",varName);

Unfortunately, that removes the whole string, as if awk cannot tell the difference between lower and upper case. Is there some regex-fu I can use that I'm not aware of?

EDIT: Confirmed, awk does not see the difference between lowercase and uppercase characters.

Example 1 (will use letter f here for better understanding of results):

varName="CHRFProtocol";
gsub(/[a-z]/,"f",varName);

Result: ffffffffffff

Example 2 (again, will use letter f here for better understanding of results):

varName="CHRFProtocol";
gsub(/[A-Z]/,"f",varName);

Result: ffffffffffff

Is this legitimate? What's doing on?

Upvotes: 3

Answers (4)

Ed Morton

Reputation: 203684

You should just be using the POSIX character class [[:lower:]], not [a-z]:

gsub(/[[:lower:]]/,"",varName)

The latter is locale-dependent, the former is not.

It seems like there's some confusion over when to use POSIX character classes vs when/how to set locale so:

1) Always use POSIX character classes when they exist for the character set you're interested in (e.g. [:digit:], [:lower:], [:punct:], etc., etc.)

2) Otherwise, set LC_ALL=C IF you're OK with how that affects your other settings (e.g. comma vs period as the thousands separator)

3) Otherwise, set LC_COLLATE=C.

See http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html and http://pubs.opengroup.org/onlinepubs/009695399/utilities/awk.html for more info on character classes and locale variables.

Upvotes: 3

Mark Reed

Reputation: 95267

Your locale settings are getting in the way. Try this:

LC_ALL=C awk 'BEGIN { 
varName="CHRFProtocol";
gsub(/[a-z]/,"f",varName);
print(varName); }'

GNU awk honors locale settings, and in most national locales on Linux, regular expressions are case-insensitive. Resetting the locale to C (=POSIX) for the duration of the awk command restores case-sensitivity.

Upvotes: 5

Kent

Reputation: 195109

example explains everything:

kent$  awk 'BEGIN{var="AaBbCcDDDdddEEEeee";print "before:"var;gsub(/[a-z]/,"",var);print "after:"var}' 
before:AaBbCcDDDdddEEEeee
after:ABCDDDEEE

Upvotes: 1

anubhava

Reputation: 785276

To remove all lowercase characters in awk, use :

gsub(/[a-z]+/, "", varName);

You're actually replacing 1 or more occurrence of lowercase alphabets with literal string "f"

EDIT After you've corrected your question:

Note that if your varName only contains lowercase alphabets or is already empty then you will get an empty string in varName.

Upvotes: 1

How to remove all lowercase characters from a string using AWK?

Answers (4)

Related Questions