Reputation:
I would like to replace following input (in a HTML page) :
<base href="" />
by <base href="http://mywebsite.com/image/" />
for different files.
Here's what I'm doing. For each file, we get the line of the file where <base
tag is located.
nb_ligne=$(grep -n '<base' $i | awk -F : '{print $1}')
We remove here the first directories above the current directory.
path_dir=$(echo $i | sed 's/^$dir_root//g')
path_dir
gives the suffix path (for example, it may be equal to /image/
in the command).
and finally:
sed -i "$nb_ligne s/\".*\"/\"http\:\/\/mywebsite.com$path_dir\"/g" $i
but this last command doesn't work ($i is the current filename
). However,
I have used double quotes for expanding the environment variables.
Upvotes: 1
Views: 659
Reputation: 189357
Sticking with sed
, here is a single substitution which does what you appear to be doing.
sed -i "s%\(<base href=\)\"\"%\1\"http://mywebsite.com${i#$dir_root}\"%" "$i"
I removed the /g
flag as you are unlikely to ever have more than one <base>
tag in a document, let alone multiple on the same line.
Upvotes: 1
Reputation: 44023
Leaving aside the question whether editing HTML with a line-based tool is a good idea and assuming that you can guarantee that the format of the HTML file will never change:
gawk -i inplace -v dir="$path_dir" '/<base/ { sub(/".*"/, "\"http://mywebsite.com" dir "\""); } 1' "$i"
It is not a good idea to use sed
for this task because you end up substituting variables into the sed code, which means that it'll be treated as code, and then you run into the usual code injection problems. If your path contains a &
, for example, you will get strange results because &
has special meaning for sed
in the context where it is used, and that's among the least terrible things that can happen if someone else controls the path name (GNU sed can be made to execute arbitrary commands with s///e
, which can be great fun).
Using awk
instead sidesteps the issue by treating $path_dir
as data from the start. The awk code itself is
/<base/ { # in lines that contain "<base"
# substitute this regex with this string. The regex and string
# are taken from your sed command.
sub(/".*"/, "\"http://mywebsite.com" dir "\"")
}
1 # afterwards, print all lines. (1 means true here, and printing
# is the default action)
If you want the effect of s///g
, use gsub
instead of sub
, but it does not make sense to me that you'd want to replace all instances of something enclosed in ""
in case there's more than one on the matching line. It looks brittle enough as it is, to be honest. You might want to consider a stricter regex such as
sub(/href=".*"/, "href=\"http://mywebsite.com" dir "\"");
at least. Perhaps even /<base href=".*"/
.
nb_ligne
is not necessary for this task, so I left it out.
The only GNU-specific feature I use is -i inplace
for inplace editing, so if you have mawk
or a very old gawk
, leave it out and use something like
cp "$i" "$i"~ && awk -v dir="$path_dir" '/<base/ { sub(/".*"/, "\"http://mywebsite.com" dir "\""); } 1' "$i"~ > "$i"
Upvotes: 2