Reputation: 19
Find/replace space with %20
I must replace all spaces in *.html
files which are inside href="something something .pdf"
with %20
.
I found a regular expression for that task:
find : href\s*=\s*['"][^'" ]*\K\h|(?!^)\G[^'" ]*\K\h
replace : %20
That regular expression works in text editors like Notepad++ or Geany. I want use that regular expression from the Linux command line with sed or perl. Solution (1):
cat test002.html | perl -ne 's/href\s*=\s*['\''"][^'\''" ]*\K\h|(?!^)\G[^'\''" ]*\K\h/%20/g; print;' > Work_OK01.html
Solution (2):
cat test002.html | perl -ne 's/href\s*=\s*[\x27"][^\x27" ]*\K\h|(?!^)\G[^\x27" ]*\K\h/%20/g; print;' > Work_OK02.html
Upvotes: 1
Views: 683
Reputation: 386331
The problem is that you don't properly escape the single quotes in your program.
If your program is
...[^'"]...
The shell literal might be
'...[^'\''"]...'
'...[^'"'"'"]...'
'...[^\x27"]...' # Avoids using a single quote to avoid escaping it.
So, you were going for
perl -ne 's/href\s*=\s*['\''"][^'\''" ]*\K\h|(?!^)\G[^'\''" ]*\K\h/%20/g; print;'
Don't try do everything at once. Here are some far cleaner (i.e. far more readable) solutions:
perl -pe's{href\s*=\s*['\''"]\K([^'\''"]*)}{ $1 =~ s/ /%20/rg }eg' # 5.14+
perl -pe's{href\s*=\s*['\''"]\K([^'\''"]*)}{ (my $s = $1) =~ s/ /%20/g; $s }eg'
Note that -p
is the same as -n
, except that it cause a print
to be performed for each line.
The above solutions make a large number of assumptions about the files that might be encountered[1]. All of these assumptions would go away if you use a proper parser.
If you have HTML files:
perl -MXML::LibXML -e'
my $doc = XML::LibXML->new->parse_file($ARGV[0]);
$_->setValue( $_->getValue() =~ s/ /%20/gr )
for $doc->findnodes(q{//@href});
binmode(STDOUT);
print($doc->toStringHTML());
' in_file.html >out_file.html
If you have XML (incl XHTML) files:
perl -MXML::LibXML -e'
my $doc = XML::LibXML->new->parse_file($ARGV[0]);
$_->setValue( $_->getValue() =~ s/ /%20/gr )
for $doc->findnodes(q{//@href});
binmode(STDOUT);
$doc->toFH(\*STDOUT);
' in_file.html >out_file.html
Assumptions made by the substitution-based solutions:
href
and =
.=
and the value.href
attributes./href\s*=/
in text (incl CDATA sections)./href\s*=/
in comments.href
.'
) in href="..."
."
) in href='...'
.href=
with an unquoted value.href
attributes aren't encoded using a character entity (e.g  
).
(SLePort makes similar assumptions even though they didn't document them. They also assume href
attributes don't contain >
.)
Upvotes: 3
Reputation: 131640
You seem to have neglected to escape the quotes inside the string you pass to Perl. So Bash sees you giving perl
the following arguments:
s/href\s*=\s*[][^'
, which results from the concatenation of the single-quoted string 's/href\s*=\s*['
and the double-quoted string "][^'"
]*Kh
, unquoted, because \K
and \h
are not special characters in the shell so it just treats them as K
and h
respectivelyThen Bash sees a pipe character |
, followed by a subshell invocation (?!^)
, in which !^
gets substituted with the first argument of the last command invoked. (See "History Expansion > Word Designators" in the Bash man page.) For example, if your last command was echo myface
then (?!^)
would look for the command named ?myface
and runs it in a subshell.
And finally, Bash gets to the sequence \G[^'" ]*\K\h/%20/g; print;'
, which is interpreted as the concatenation of G
(from \G
), [^
, and the single-quoted string " ]*\K\h/%20/g; print;
. Bash has no idea what to do with G[^" ]*\K\h/%20/g; print;
, since it just finished parsing a subshell invocation and expects to see a semicolon, line break, or logical operator (or so on) before getting another arbitrary string.
Solution: properly quote the expression you give to perl
. You'll need to use a combination of single and double quotes to pull it off, e.g.
perl -ne 's/href\s*=\s*['"'\"][^'\" ]*"'\K\h|(?!^)\G[^'"'\" ]*"'\K\h/%20/g; print;'
Upvotes: 0
Reputation: 15461
An xml parser would be more suited for that job(eg. XMLStarlet, xmllint,...), but if you don't have newlines in your a
tags, the below sed should work.
Using the t
command and backreferences, it loops over and replace all spaces up to last "
inside the a
tags:
$ sed ':a;s/\(<a [^>]*href=[^"]*"[^ ]*\) \([^"]*">\)/\1%20\2/;ta' <<< '<a href="http://url with spaces">'
<a href="http://url%20with%20spaces">
Upvotes: 1