Reputation: 109

Replacing everything between two strings in UNIX shell

I want to write a shell script that gets all "a href" HTML tags from provided link and prints them to the console. The problem I am facing right now is removing all of the text I don't need between them. After some googling I came to a conclusion, that the "sed" command would be the best for this job, however, I cannot figure out how to write it correctly

#!/bin/sh
wget -qO - $1 | grep -E "*<[Aa]([[:print:]])*( |'\n')[Hh][Rr][Ee][Ff]([[:print:]])*</a" | sed 's/<\/a>.*<a/<\/a>REPLACED\n<a/g'

What I am trying to do is to replace EVERYTHING between the "</a>" closing tag and the next "<a" opening tag (I don't know much about HTML, but there may be other tags that have "a" as opening and closing, but that's a problem for later), however, this (and a few different ways I have tried) only works sometimes.

I am new to shell scripting, so any suggestions are welcome, maybe "sed" is not the command for the job, hope you can help me, thanks in advance

Edit 1: from this:

<a href="http://www.canonical.com">Canonical</a></li></ul></li></ul></div></div> <script> $(function() { $(".nav-global .more > a").click(function(e){ $(this).closest(".more").toggleClass("open"); return false; }); $(document).click(function(){ $(".nav-global .more.open").removeClass("open"); }); }); </script></div>
            <a href="#" class="s-topbar--menu-btn js-left-sidebar-toggle" role="menuitem" aria-haspopup="true" aria-controls="left-sidebar" aria-expanded="false"><span></span></a>

to this:

<a href="http://www.canonical.com">Canonical</a>REPLACED<a href="#" class="s-topbar--menu-btn js-left-sidebar-toggle" role="menuitem" aria-haspopup="true" aria-controls="left-sidebar" aria-expanded="false"><span></span></a>

Edit 2: It seems I am bad at explaining exactly what I expect. For large-scale testing, I use the link https://askubuntu.com/questions/726076/whats-wrong-with-my-grep-command. What I am trying to achieve is to have ONLY "a href" (or other HTML tags that start with "<a" and end with "</a>") separated by "REPLACED" as shown in previous edit

Upvotes: 3

Answers (3)

wangloo

Reputation: 105

Output result to stdout:

sed -z 's/\(<\/a>\).*\(<a\)/\1REPLACED\2/g' inputfile

Upvotes: 0

RavinderSingh13

Reputation: 133518

1st solution: With your shown samples please try following awk code. Written and tested in GNU awk.

awk -v RS="" -v FS='<\\/a>.*<a href=' '{print $1"</a>REPLACED<a href="$2}' Input_file

2nd solution: Using RS and sub functions of awk, written and tested in GNU awk.

awk -v RS="" '{sub(/<\/a>.*<a href=/,"</a>REPLACED<a href=")} 1' Input_file

Upvotes: 4

sseLtaH

Reputation: 11227

Using sed

$ sed -Ez 's~(<[^<]*)[^\n]*\n +~\1</a>REPLACED~' input_file
<a href="http://www.canonical.com">Canonical</a>REPLACED<a href="#" class="s-topbar--menu-btn js-left-sidebar-toggle" role="menuitem" aria-haspopup="true" aria-controls="left-sidebar" aria-expanded="false"><span></span></a>

Upvotes: 2

Replacing everything between two strings in UNIX shell

Answers (3)

Related Questions