Reputation: 43
Using the command echo "Jiro. Inagaki' & Soul, Media_Breeze." | tr -d '[:punct:]'
prints the string "Jiro Inagaki Soul MediaBreeze".
However, I want to find a regular expression that will remove all punctuation except the underscore and ampersand i.e. I want "Jiro Inagaki & Soul Media_Breeze".
Following advice on character class subtraction from the sources listed at the bottom, I've tried replacing [:punct:]
with the following:
[\p{P}\-[&_]]
[[:punct:]-[&_]]
(?![\&_])\p{P}
(?![\&_])[:punct:]
[[:punct:]-[&_]]
[[:punct:]&&[&_]]
[[:punct:]&&[^&_]]
... but I haven't gotten anything to work so far. Any help would be much appreciated!
Sources:
Upvotes: 4
Views: 1503
Reputation: 1624
Posting my comment as an answer as requested by @jared_mamrot.
You can manually type out the set of punctuation, excluding _
, that you want to delete. I took my punctuation set from GNU docs on [:punct:]
:
‘[:punct:]’ Punctuation characters; in the ‘C’ locale and ASCII character encoding, this is ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.
You can also look at POSIX docs which says the character classes depend on locale:
punct <exclamation-mark>;<quotation-mark>;<number-sign>;\
<dollar-sign>;<percent-sign>;<ampersand>;<apostrophe>;\
<left-parenthesis>;<right-parenthesis>;<asterisk>;\
<plus-sign>;<comma>;<hyphen>;<period>;<slash>;\
<colon>;<semicolon>;<less-than-sign>;<equals-sign>;\
<greater-than-sign>;<question-mark>;<commercial-at>;\
<left-square-bracket>;<backslash>;<right-square-bracket>;\
<circumflex>;<underscore>;<grave-accent>;<left-curly-bracket>;\
<vertical-line>;<right-curly-bracket>;<tilde>
$ echo 'abcd_!"#$%()*+,-./:;<=>?@][\\^`{}|~'"'" | tr -d '!"#$%()*+,-./:;<=>?@][\\^`{}|~'"'"
abcd_
The set of characters in the tr
command should be straightforward except for backslash, \\
, which has been escaped for tr
, and single quote, "'"
, which is being concatenated as a string quoted in double quotes, since you can't escape a single quote within single quotes.
I do prefer using @jared_marmot's complement solution, if possible, though. It is much neater.
Upvotes: 2
Reputation: 26640
You can specify the punctuation marks you want removed, e.g.
>echo "Jiro. Inagaki' & Soul, Media_Breeze." | tr -d "[.,/\\-\=\+\{\[\]\}\!\@\#\$\%\^\*\'\\\(\)]"
Jiro Inagaki & Soul Media_Breeze
Or, alternatively,
>echo "Jiro. Inagaki' & Soul, Media_Breeze." | tr -dc '[:alnum:] &_'
Jiro Inagaki & Soul Media_Breeze
Upvotes: 4