EET FEK
EET FEK

Reputation: 43

How to remove punctuation from a string with exceptions using regex in bash

Using the command echo "Jiro. Inagaki' & Soul, Media_Breeze." | tr -d '[:punct:]' prints the string "Jiro Inagaki Soul MediaBreeze".

However, I want to find a regular expression that will remove all punctuation except the underscore and ampersand i.e. I want "Jiro Inagaki & Soul Media_Breeze".

Following advice on character class subtraction from the sources listed at the bottom, I've tried replacing [:punct:] with the following:

... but I haven't gotten anything to work so far. Any help would be much appreciated!

Sources:

Upvotes: 4

Views: 1503

Answers (2)

dosentmatter
dosentmatter

Reputation: 1624

Posting my comment as an answer as requested by @jared_mamrot.

You can manually type out the set of punctuation, excluding _, that you want to delete. I took my punctuation set from GNU docs on [:punct:]:

‘[:punct:]’ Punctuation characters; in the ‘C’ locale and ASCII character encoding, this is ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.

You can also look at POSIX docs which says the character classes depend on locale:

punct    <exclamation-mark>;<quotation-mark>;<number-sign>;\
         <dollar-sign>;<percent-sign>;<ampersand>;<apostrophe>;\
         <left-parenthesis>;<right-parenthesis>;<asterisk>;\
         <plus-sign>;<comma>;<hyphen>;<period>;<slash>;\
         <colon>;<semicolon>;<less-than-sign>;<equals-sign>;\
         <greater-than-sign>;<question-mark>;<commercial-at>;\
         <left-square-bracket>;<backslash>;<right-square-bracket>;\
         <circumflex>;<underscore>;<grave-accent>;<left-curly-bracket>;\
         <vertical-line>;<right-curly-bracket>;<tilde>
$ echo 'abcd_!"#$%()*+,-./:;<=>?@][\\^`{}|~'"'" | tr -d '!"#$%()*+,-./:;<=>?@][\\^`{}|~'"'"
abcd_

The set of characters in the tr command should be straightforward except for backslash, \\, which has been escaped for tr, and single quote, "'", which is being concatenated as a string quoted in double quotes, since you can't escape a single quote within single quotes.

I do prefer using @jared_marmot's complement solution, if possible, though. It is much neater.

Upvotes: 2

jared_mamrot
jared_mamrot

Reputation: 26640

You can specify the punctuation marks you want removed, e.g.

>echo "Jiro. Inagaki' & Soul, Media_Breeze." | tr -d "[.,/\\-\=\+\{\[\]\}\!\@\#\$\%\^\*\'\\\(\)]"
Jiro Inagaki & Soul Media_Breeze

Or, alternatively,

>echo "Jiro. Inagaki' & Soul, Media_Breeze." | tr -dc '[:alnum:] &_'
Jiro Inagaki & Soul Media_Breeze

Upvotes: 4

Related Questions