Reputation: 93
I have a simple regular expression that creates a group match for any semicolon contained within double quotes. I'm trying to use sed on Mac OS X to replace the semicolon with 'SEMICOLON'.
However, it's not working.
Here's the command I used:
sed -i.bu "s|.*?(;).*?|SEMICOLON|g" output/html/index.html
The result is that nothing is matched and nothing is replaced.
Desired behavior:
Input
"The man sat; the man cried;" cats; dogs;
Output
"The man satSEMICOLON the man criedSEMICOLON" cats; dogs;
UPDATE:
Thanks for your help everyone. So my example wasn't very good. In reality, I process a JavaScript file that's been condensed to one line, and make sure each JavaScript statement has its own line. The problem is that the JavaScript is mostly translated text, so trying to make a simple regex that would insert a newline after each ;
was difficult, because I obviously don't want a newline added if the semicolon is in quotes.
Long story short... I realized I was trying to reinvent the wheel, and decided to use js-beautify
to pretty print
the file. It's doing a little more than I need... but it's the best solution for now.
Thanks again!
Upvotes: 1
Views: 6175
Reputation: 113834
Let's take this as a test file:
$ cat file
"The man sat; the man cried;" cats; dogs;
1; 2; "man;"; 3; ";dog";
Try this sed command:
$ sed -E ':a; s/^(([^"]*"[^"]*")*[^"]*"[^"]*);/\1SEMICOLON/; ta' file
"The man satSEMICOLON the man criedSEMICOLON" cats; dogs;
1; 2; "manSEMICOLON"; 3; "SEMICOLONdog";
How it works:
:a
This creates a label a
that we can refer to later.
s/^(([^"]*"[^"]*")*[^"]*"[^"]*);/\1SEMICOLON/
This replaces the last ;
that is inside double-quotes with SEMICOLON. Let's look at ^(([^"]*"[^"]*")*[^"]*"[^"]*);
in more detail:
^
matches at the beginning of a string.
([^"]*"[^"]*")*
matches from the beginning of the line through any number of complete quoted strings.
Because, in sed, regular expressions are greedy (more precisely, leftmost-longest), this will try to match as many complete quoted strings as it can.
[^"]*"[^"]*;
matches any non-quotes that follow the complete quoted strings (as above), followed the next quote character, followed by any number of non-quote characters, followed by ;
.
Since the above regex minus the final ;
is itself inside parens, it is saved as group 1. We replace the matched text with group 1 followed by SEMICOLON.
ta
If the last command resulted in a substitution (in other words, we found a ;
that needed to be replaced), then jump back to label a
and repeat.
Let's consider:
sed "s|.*?(;).*?|SEMICOLON|g"
In Python and elsewhere, .*?
is a non-greedy match. Sed, however, has no such concept. For that matter, by default, sed uses Basic Regular Expressions (BRE) in which ?
just means a literal question mark.
Also, it is asking for trouble to put sed commands in double-quotes as this invites the shell to modify it.
So, since BRE are obsolete, let's (1) switch to Extended Regular Expressions (ERE) using the -E
switch, (2) put the command in single-quotes, and (3) change .*?
to .*
:
$ sed -E 's|.*(;).*|SEMICOLON|g' file
SEMICOLON
(Compatibility note: if you are on a very old linux system, you may need to replace -E
with -r
.)
.*(;).*
matches everything up to the last semicolon on the line, followed by the semicolon, followed by whatever follows the last semicolon. In other words, if the line contains a semicolon, .*(;).*
matches the whole line. That is why the output is just SEMICOLON
.
Also, (;)
matches a semicolon and saves it in group 1. Since we never use group 1 anywhere, this does nothing for us. We would get the same result with:
$ sed -E 's|.*;.*|SEMICOLON|g' file
SEMICOLON
If we remove the .*
, then every ;
will be replaced:
$ sed -E 's|;|SEMICOLON|g' file
"The man satSEMICOLON the man criedSEMICOLON" catsSEMICOLON dogsSEMICOLON
If we want to replace the last ;
in the first quoted string, we could use:
$ sed -E 's|^([^"]*"[^"]*);|\1SEMICOLON|g' file
"The man sat; the man criedSEMICOLON" cats; dogs;
If we want to replace all ;
that are within any quoted string on the line, then we are back to the command at the top.
Let's consider a test file with a string spanning 2 lines:
$ cat file2
"man;" cat "dog
;"; ";man";
If you have GNU sed:
$ sed -Ez ':a; s/^(([^"]*"[^"]*")*[^"]*"[^"]*);/\1SEMICOLON/; ta' file2
"manSEMICOLON" cat "dog
SEMICOLON"; "SEMICOLONman";
In general for any POSIX sed:
$ sed -E 'H;1h;$!d;x; :a; s/^(([^"]*"[^"]*")*[^"]*"[^"]*);/\1SEMICOLON/; ta' file2
"manSEMICOLON" cat "dog
SEMICOLON"; "SEMICOLONman";
Upvotes: 4
Reputation: 203324
sed is for simple s/old/new that is all. With any awk:
$ awk 'match($0,/"[^"]+"/) {
str = substr($0,RSTART,RLENGTH)
gsub(/;/,"SEMICOLON",str)
$0 = substr($0,1,RSTART-1) str substr($0,RSTART+RLENGTH)
} 1' file
"The man satSEMICOLON the man criedSEMICOLON" cats; dogs;
That's assuming you actually want all semicolons in the quoted string treated the same way. If not, whatever it is you want to do is an easy tweak, e.g. if you want that last semicolon after cried
removed instead of replaced as shown in your sample output:
$ awk 'match($0,/"[^"]+"/) {
str = substr($0,RSTART+1,RLENGTH-2)
sub(/;$/,"",str)
gsub(/;/,"SEMICOLON",str)
$0 = substr($0,1,RSTART) str substr($0,RSTART+RLENGTH-1)
} 1' file
"The man satSEMICOLON the man cried" cats; dogs;
Upvotes: 1