Meister Duisburg
Meister Duisburg

Reputation: 63

How to delete html meta tag using sed?

Ive made a lot of index.html ´s with httrack. now i want to delete same 2 added meta tags with sed.

the meta tag called:

<!-- Added by HTTrack --><meta http-equiv="content-type" content="text/html;charset=UTF-8"><!-- /Added by HTTrack -->

i think this is a solution to edit all index htmls recursiv in the folder.

cd /home/user/websites
grep -lr -e 'index' *.html | xargs sed -i 's/<!-- Added by HTTrack --><meta http-equiv="content-type" content="text/html;charset=UTF-8"><!-- /Added by HTTrack -->
//g'

it doesent work please help me thanks

Upvotes: 1

Views: 1277

Answers (1)

Adam Zalcman
Adam Zalcman

Reputation: 27233

Try this:

grep -lr -e 'index' *.html | xargs sed -i .bak -e 's#<!-- Added by HTTrack --><meta http-equiv="content-type" content="text/html;charset=UTF-8"><!-- /Added by HTTrack -->##g'

This will only work however if the files you want to modify must contain index. If you want to modify all index.html files under the current directory and its subdirectories use this:

find . -name 'index.html' | xargs sed -i .bak -e 's#<!-- Added by HTTrack --><meta http-equiv="content-type" content="text/html;charset=UTF-8"><!-- /Added by HTTrack -->##g'

Either way, the important thing was to replace / with # in the sed's s command. This command allows you to use almost any separator as long as you're consistent (i.e. all three separators are the same character). You should pick the separator that does not appear in your expression or you have to escape it there.

Also note that I modified flags to sed. From sed man page:

 -i extension
         Edit files in-place, saving backups with the specified extension.  If a zero-length extension is given, no backup will be saved.  It is not recom-
         mended to give a zero-length extension when in-place editing files, as you risk corruption or partial content in situations where disk space is
         exhausted, etc.

 -e command
         Append the editing commands specified by the command argument to the list of commands.

This means that my commands will save every file as a backup before modifying and append '.bak' to the backup file's name. If you don't want the backups pass the zero-length extension like this: -i ''.

In general regular expressions are not powerful enough to parse HTML. Here it works only because you have a fixed sequence of characters to replace which just happen to be HTML.

Upvotes: 2

Related Questions