Reputation: 3584
I am using Ubuntu 10.10 and using Grep to process some HTML files.
Here is the HTML snippet:
<a href="video.php?video=one-hd.mov"><img src="/1.jpg"><a href="video.php?video=normal.mov"><img src="/2.jpg"><a href="video.php?video=another-hd.mov">
I would like to extract one-hd.mov
and another-hd.mov
but ignore normal.mov
.
Here is my code:
example='<a href="video.php?video=one-hd.mov"><img src="/1.jpg"><a href="video.php?video=normal.mov"><img src="/2.jpg"><a href="video.php?video=another-hd.mov">'
echo $example | grep -Po '(?<=video.php\?video=).*?(?=-hd.mov">)'
The result is:
one
normal.mov"><img src="/2.jpg"><a href="video.php?video=another
But I want
one
another
There is a mismatch there.
Is this because of the so-called Greedy Regular Expression?
I am sing GREP but any command line bash tools are welcome to solve this problem like sed etc.
Thanks a lot.
Upvotes: 1
Views: 1133
Reputation: 45670
Solution using awk:
{
for(i=1;i<NF;i++) {
if ($i ~ /mov/) {
if ($i !~ /normal/){
sub(/^.*=/, "", $i)
print $i
}
}
}
}
outputs:
$ awk -F'"' -f h.awk html
one-hd.mov
another-hd.mov
But I strongly advice you to use a html-parser for this instead, something like BeautifulSoup
Upvotes: 1
Reputation: 40850
Here is a solution using xmlstarlet:
$ example='<a href="video.php?video=one-hd.mov"><img src="/1.jpg"><a href="video.php?video=normal.mov"><img src="/2.jpg"><a href="video.php?video=another-hd.mov">'
$ echo $example | xmlstarlet fo -R 2>/dev/null | xmlstarlet sel -t -m "//*[substring(@href, string-length(@href) - 6, 7) = '-hd.mov']" -v 'substring(@href,17, string-length(@href) - 17 - 3)' -n
one-hd
another-hd
$
Upvotes: 1
Reputation: 63972
You want use Perl regexes for grep - why not directly perl?
echo "$example" | perl -nle 'm/.*?video.php\?video=([^"]+)">.*video.php\?video=([^"]+)".*/; print "=$1=$2="'
will print
=one-hd.mov=another-hd.mov=
Upvotes: 2