Reputation: 13
I need to remove the hyphen '-' character only when it matches the pattern 'space-[A-Z]' or '[A-Z]-space'. (Assuming all letters are uppercase, and space could be a space, or newline)
sample.txt
I AM EMPTY-HANDED AND I- WA-
-ANT SOME COO- COOKIES
I want the output to be
I AM EMPTY-HANDED AND I WA
ANT SOME COO COOKIES
I've looked around for answers using sed and awk and perl, but I could only find answers relating to removing all characters between two patterns or specific strings, but not a specific character between [A-Z] and space.
Thanks heaps!!
Upvotes: 0
Views: 242
Reputation: 1517
awk '{sub(/ -/,"");sub(/^-|-$/,"");sub(/- /," ")}1' file
I AM EMPTY-HANDED AND I WA
ANT SOME COO COOKIES
Upvotes: 0
Reputation: 84561
If you can provide Extended Regular Expressions to sed
(generally with the -E
or -r
option), then you can shorten your sed
expression to:
sed -E 's/(^|\s)-(\w)/\1\2/g;s/(\w)-(\s|$)/\1\2/g' file
Where the basic form is sed -E 's/find1/replace1/g;s/find2/replace2/g' file
which can also be written as separate expressions sed -E -e 's/find1/replace1/g' -e 's/find2/replace2/g'
(your choice).
The details of s/find1/replace1/g
are:
find1
is
(^|\s)
locate and capture at the beginning or whitespace,'-'
hyphen,\w
(word-character); andreplace1
is simply \1\2
reinsert both captures with the first two backreferences.The next substitution expression is similar, except now you are looking for the hyphen followed by a whitespace or at the end. So you have:
find2
being
\w
(word-character),(\s|$)
, thenreplace2
is the same as before, just reinsert the captured characters using backreferences.In each case the g
indicates a global replace of all occurrences.
(note: the \w
word-character also includes the '_'
(underscore), so while unlikely you would have a hyphen and underscore together, if you do, you need to use the [A-Za-z]
list instead of \w
)
Example Use/Output
In your case, then output is:
$ sed -E 's/(^|\s)-(\w)/\1\2/g;s/(\w)-(\s|$)/\1\2/g' file
I AM EMPTY-HANDED AND I WA
ANT SOME COO COOKIES
Upvotes: 2
Reputation: 22012
If perl
is your option, would you try the following:
perl -pe 's/(^|(?<=\s))-(?=[A-Z])//g; s/(?<=[A-Z])-((?=\s)|$)//g' sample.txt
(?<=\s)
is a zero-width lookbehind assertion which matches leading
whitespace without including it in the matched substring.(?=[A-Z])
is a zero-width lookahead assertion which matches trailing
character between A and Z without including it in the matched substring.s/..//g
is the flipped version of the first one.Upvotes: 3
Reputation: 141020
remove the hyphen '-' character only when it matches the pattern 'space-[A-Z]' or '[A-Z]-space'. Assuming all letters are uppercase, and space could be a space, or newline
It's:
sed 's/\( \|^\)-\([A-Z]\)/\1\2/g; s/\([A-Z]\)-\( \|$\)/\1\2/g'
s
- substitute
/
\( \|^\)
- space or beginning of the line-
- hyphen...\(A-Z]\)
- a single upper case character/
\1\2
- The \1
is replaced by the first \(...\)
thing. So it is replaced by a space or nothing. \2
is replaced by the single upper case character found. Effectively -
is removed./
g
apply the regex globally;
- separate two s
commandss
$
means end of the line.Upvotes: 1
Reputation: 133518
Could you please try following.
awk '{for(i=1;i<=NF;i++){if($i ~ /^-[a-zA-Z]+$|^[a-zA-Z]+-$/){sub(/-/,"",$i)}}} 1' Input_file
Adding a non-one liner form of solution:
awk '
{
for(i=1;i<=NF;i++){
if($i ~ /^-[a-zA-Z]+$|^[a-zA-Z]+-$/){
sub(/-/,"",$i)
}
}
}
1
' Input_file
Output will be as follows.
I AM EMPTY-HANDED AND I WA
ANT SOME COO COOKIES
Upvotes: 2