Reputation: 109
I want to write a shell script that gets all "a href" HTML tags from provided link and prints them to the console. The problem I am facing right now is removing all of the text I don't need between them. After some googling I came to a conclusion, that the "sed" command would be the best for this job, however, I cannot figure out how to write it correctly
#!/bin/sh
wget -qO - $1 | grep -E "*<[Aa]([[:print:]])*( |'\n')[Hh][Rr][Ee][Ff]([[:print:]])*</a" | sed 's/<\/a>.*<a/<\/a>REPLACED\n<a/g'
What I am trying to do is to replace EVERYTHING between the "</a>" closing tag and the next "<a" opening tag (I don't know much about HTML, but there may be other tags that have "a" as opening and closing, but that's a problem for later), however, this (and a few different ways I have tried) only works sometimes.
I am new to shell scripting, so any suggestions are welcome, maybe "sed" is not the command for the job, hope you can help me, thanks in advance
Edit 1: from this:
<a href="http://www.canonical.com">Canonical</a></li></ul></li></ul></div></div> <script> $(function() { $(".nav-global .more > a").click(function(e){ $(this).closest(".more").toggleClass("open"); return false; }); $(document).click(function(){ $(".nav-global .more.open").removeClass("open"); }); }); </script></div>
<a href="#" class="s-topbar--menu-btn js-left-sidebar-toggle" role="menuitem" aria-haspopup="true" aria-controls="left-sidebar" aria-expanded="false"><span></span></a>
to this:
<a href="http://www.canonical.com">Canonical</a>REPLACED<a href="#" class="s-topbar--menu-btn js-left-sidebar-toggle" role="menuitem" aria-haspopup="true" aria-controls="left-sidebar" aria-expanded="false"><span></span></a>
Edit 2: It seems I am bad at explaining exactly what I expect. For large-scale testing, I use the link https://askubuntu.com/questions/726076/whats-wrong-with-my-grep-command. What I am trying to achieve is to have ONLY "a href" (or other HTML tags that start with "<a" and end with "</a>") separated by "REPLACED" as shown in previous edit
Upvotes: 3
Views: 231
Reputation: 105
Output result to stdout
:
sed -z 's/\(<\/a>\).*\(<a\)/\1REPLACED\2/g' inputfile
Upvotes: 0
Reputation: 133518
1st solution: With your shown samples please try following awk
code. Written and tested in GNU awk
.
awk -v RS="" -v FS='<\\/a>.*<a href=' '{print $1"</a>REPLACED<a href="$2}' Input_file
2nd solution: Using RS
and sub
functions of awk
, written and tested in GNU awk
.
awk -v RS="" '{sub(/<\/a>.*<a href=/,"</a>REPLACED<a href=")} 1' Input_file
Upvotes: 4
Reputation: 11227
Using sed
$ sed -Ez 's~(<[^<]*)[^\n]*\n +~\1</a>REPLACED~' input_file
<a href="http://www.canonical.com">Canonical</a>REPLACED<a href="#" class="s-topbar--menu-btn js-left-sidebar-toggle" role="menuitem" aria-haspopup="true" aria-controls="left-sidebar" aria-expanded="false"><span></span></a>
Upvotes: 2