Reputation: 128

In bash/sed, how do you match on a lowercase letter followed by the SAME letter in uppercase?

I want to delete all instances of "aA", "bB" ... "zZ" from an input string.

e.g.

echo "foObar" |
sed -Ee 's/([a-z])\U\1//'

should output "fbar"

But the \U syntax works in the latter half (replacement part) of the sed expression - it fails to resolve in the matching clause.

I'm having difficulty converting the matched character to upper case to reuse in the matching clause.

If anyone could suggest a working regex which can be used in sed (or awk) that would be great.

Scripting solutions in pure shell are ok too (I'm trying to think of solving the problem this way).

Working PCRE (Perl-compatible regular expressions) are ok too but I have no idea how they work so it might be nice if you could provide an explanation to go with your answer.

Unfortunately, I don't have perl or python installed on the machine that I am working with.

Upvotes: 2

Answers (5)

Jay

Reputation: 3950

Note: This solution is (unsurprisingly) slow, based on OP's feedback:
"Unfortunately, due to the multiple passes - it makes it rather slow. "

If there is a character sequence¹ that you know won't ever appear in the input,
you could use a 3-stage replacement to accomplish this with sed:

echo 'foObar foobAr' | sed -E -e 's/([a-z])([A-Z])/KEYWORD\1\l\2/g' -e 's/KEYWORD(.)\1//g' -e 's/KEYWORD(.)(.)/\1\u\2/g'

gives you: fbar foobAr

Replacement stages explained:

Look for lowercase letters followed by ANY uppercase letter and replace them with both letters as lowercase with the KEYWORD in front of them foObar foobAr -> fKEYWORDoobar fooKEYWORDbar
Remove KEYWORD followed by two identical characters (both are lowercase now, so the back-reference works) fKEYWORDoobar fooKEYWORDbar -> fbar fooKEYWORDbar
Strip remaining² KEYWORD from the output and convert the second character after it back to it's original, uppercase version fbar fooKEYWORDbar -> fbar foobAr

¹ _{In this example I used KEYWORD for demonstration purposes. A single character or at least shorter character sequence would be better/faster. Just make sure to pick something that cannot possibly ever be in the input.}
² _{The remaining occurances are those where the lowercase-versions of the letters were not identical, so we have to revert them back to their original state}

Upvotes: 1

potong

Reputation: 58440

This might work for you (GNU sed):

sed -r 's/aA|bB|cC|dD|eE|fF|gG|hH|iI|jJ|kK|lL|mM|nN|oO|pP|qQ|rR|sS|tT|uU|vV|wW|xX|yY|zZ//g' file

A programmatic solution:

sed 's/[[:lower:]][[:upper:]]/\n&/g;s/\n\(.\)\1//ig;s/\n//g' file

This marks all pairs of lower-case characters followed by an upper-case character with a preceding newline. Then remove altogether such marker and pairs that match by a back reference irrespective of case. Any other newlines are removed thus leaving pairs untouched that are not the same.

Upvotes: 3

jthill

Reputation: 60303

There's an easy lex for this,

%option main 8bit
    #include <ctype.h>
%%
[[:lower:]][[:upper:]] if ( toupper(yytext[0]) != yytext[1] ) ECHO;

(that's a tab before the #include, markdown loses those). Just put that in e.g. that.l and then make that. Easy-peasy lex's are a nice addition to your toolkit.

Upvotes: 1

anubhava

Reputation: 785286

Here is a verbose awk solution as OP doesn't have perl or python available:

echo "foObar" |
awk -v ORS= -v FS='' '{
   for (i=2; i<=NF; i++) {
      if ($(i-1) == tolower($i) && $i ~ /[A-Z]/ && $(i-1) ~ /[a-z]/) {
         i++
         continue
      }
      print $(i-1)
   }
   print $(i-1)
}'

fbar

Upvotes: 2

Wiktor Stribiżew

Reputation: 626936

You may use the following perl solution:

echo "foObar" | perl -pe 's/([a-z])(?!\1)(?i:\1)//g'

See the online demo.

Details

([a-z]) - Group 1: a lowercase ASCII letter
(?!\1) - a negative lookahead that fails the match if the next char is the same as captured with Group 1
(?i:\1) - the same char as captured with Group 1 but in the different case (due to the lookahead before it).

The -e option allows you to define Perl code to be executed by the compiler and the -p option always prints the contents of $_ each time around the loop. See more here.

Upvotes: 3

In bash/sed, how do you match on a lowercase letter followed by the SAME letter in uppercase?

Answers (5)

Related Questions