Garak
Garak

Reputation: 19

Regular expression usage in command line - replace space with %20 inside href

Find/replace space with %20

I must replace all spaces in *.html files which are inside href="something something .pdf" with %20. I found a regular expression for that task:

find    : href\s*=\s*['"][^'" ]*\K\h|(?!^)\G[^'" ]*\K\h
replace : %20

That regular expression works in text editors like Notepad++ or Geany. I want use that regular expression from the Linux command line with sed or perl. Solution (1):

    cat test002.html | perl -ne 's/href\s*=\s*['\''"][^'\''" ]*\K\h|(?!^)\G[^'\''" ]*\K\h/%20/g; print;' > Work_OK01.html

Solution (2):

    cat test002.html | perl -ne 's/href\s*=\s*[\x27"][^\x27" ]*\K\h|(?!^)\G[^\x27" ]*\K\h/%20/g; print;' > Work_OK02.html

Upvotes: 1

Views: 683

Answers (3)

ikegami
ikegami

Reputation: 386331

The problem is that you don't properly escape the single quotes in your program.

If your program is

...[^'"]...

The shell literal might be

'...[^'\''"]...'

'...[^'"'"'"]...'

'...[^\x27"]...'    # Avoids using a single quote to avoid escaping it.

So, you were going for

perl -ne 's/href\s*=\s*['\''"][^'\''" ]*\K\h|(?!^)\G[^'\''" ]*\K\h/%20/g; print;'

Don't try do everything at once. Here are some far cleaner (i.e. far more readable) solutions:

perl -pe's{href\s*=\s*['\''"]\K([^'\''"]*)}{ $1 =~ s/ /%20/rg }eg'                # 5.14+

perl -pe's{href\s*=\s*['\''"]\K([^'\''"]*)}{ (my $s = $1) =~ s/ /%20/g; $s }eg'

Note that -p is the same as -n, except that it cause a print to be performed for each line.


The above solutions make a large number of assumptions about the files that might be encountered[1]. All of these assumptions would go away if you use a proper parser.

If you have HTML files:

perl -MXML::LibXML -e'
   my $doc = XML::LibXML->new->parse_file($ARGV[0]);
   $_->setValue( $_->getValue() =~ s/ /%20/gr )
      for $doc->findnodes(q{//@href});
   binmode(STDOUT);
   print($doc->toStringHTML());
' in_file.html >out_file.html

If you have XML (incl XHTML) files:

perl -MXML::LibXML -e'
   my $doc = XML::LibXML->new->parse_file($ARGV[0]);
   $_->setValue( $_->getValue() =~ s/ /%20/gr )
      for $doc->findnodes(q{//@href});
   binmode(STDOUT);
   $doc->toFH(\*STDOUT);
' in_file.html >out_file.html

  1. Assumptions made by the substitution-based solutions:

    • File uses an ASCII-based encoding (e.g. UTF-8, iso-latin-1, but not UTF-16le).
    • No newline between href and =.
    • No newline between = and the value.
    • No newline in the value of href attributes.
    • Nothing matching /href\s*=/ in text (incl CDATA sections).
    • Nothing matching /href\s*=/ in comments.
    • No other attributes have a name ending in href.
    • No single quote (') in href="...".
    • No double quote (") in href='...'.
    • No href= with an unquoted value.
    • Space in href attributes aren't encoded using a character entity (e.g  ).
    • Maybe more?

    (SLePort makes similar assumptions even though they didn't document them. They also assume href attributes don't contain >.)

Upvotes: 3

David Z
David Z

Reputation: 131640

You seem to have neglected to escape the quotes inside the string you pass to Perl. So Bash sees you giving perl the following arguments:

  1. s/href\s*=\s*[][^', which results from the concatenation of the single-quoted string 's/href\s*=\s*[' and the double-quoted string "][^'"
  2. ]*Kh, unquoted, because \K and \h are not special characters in the shell so it just treats them as K and h respectively

Then Bash sees a pipe character |, followed by a subshell invocation (?!^), in which !^ gets substituted with the first argument of the last command invoked. (See "History Expansion > Word Designators" in the Bash man page.) For example, if your last command was echo myface then (?!^) would look for the command named ?myface and runs it in a subshell.

And finally, Bash gets to the sequence \G[^'" ]*\K\h/%20/g; print;', which is interpreted as the concatenation of G (from \G), [^, and the single-quoted string " ]*\K\h/%20/g; print;. Bash has no idea what to do with G[^" ]*\K\h/%20/g; print;, since it just finished parsing a subshell invocation and expects to see a semicolon, line break, or logical operator (or so on) before getting another arbitrary string.

Solution: properly quote the expression you give to perl. You'll need to use a combination of single and double quotes to pull it off, e.g.

perl -ne 's/href\s*=\s*['"'\"][^'\" ]*"'\K\h|(?!^)\G[^'"'\" ]*"'\K\h/%20/g; print;'

Upvotes: 0

SLePort
SLePort

Reputation: 15461

An xml parser would be more suited for that job(eg. XMLStarlet, xmllint,...), but if you don't have newlines in your a tags, the below sed should work.

Using the t command and backreferences, it loops over and replace all spaces up to last " inside the a tags:

$ sed ':a;s/\(<a [^>]*href=[^"]*"[^ ]*\) \([^"]*">\)/\1%20\2/;ta' <<< '<a href="http://url with spaces">'
<a href="http://url%20with%20spaces">

Upvotes: 1

Related Questions