Reputation: 63445
I have an html page with the following content:
[...]
<tr><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.1.zip">play-1.0.2.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.zip">play-1.0.2.zip</a></td></tr>
[...]
And I'd like to extract just
play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip
to then find the latest version (in this case it would be play-1.0.2.1.zip)
So I tried with
cat tmp.html | grep "<a href=\".*\""
<a href="play-1.0.1.zip">play-1.0.1.zip</a></td><td class="m"
<a href="play-1.0.2.1.zip">play-1.0.2.1.zip</a></td><td class="m"
<a href="play-1.0.2.zip">play-1.0.2.zip</a></td><td class="m"
So I tried with lazy:
cat tmp.html | grep "<a href=\".*?\""
and negating the quotes
cat tmp.html | grep "<a href=\"[^\"]*?\""
both of them returning nothing
I need to get only the matching part (not the href), and then to find the latest, but I'm stuck with this greediness problem...
--
thanks a lot for all the answers, they were all pretty useful, it's hard to decide which one is correct, in the end I've solved it with:
grep -v '.*-RC.*' index.html | grep -oP 'play-1.*?.zip' | sort -Vru | head -1
Upvotes: 3
Views: 555
Reputation: 31
Didn't see cut (and I like it for its brevity & speed) so:
cut -d\" -f4 tmp.html | sort -Vu | tail -1
output:
play-1.0.2.1.zip
Upvotes: 3
Reputation: 278
Using the answer provided by Craig Andrews with the addition of OSX support.
grep -o -P '(?<=^<tr><td class="n"><a href=").*?(?=")' /test.html | sort -n -r -k1.10,12
Result:
play-1.0.2.1.zip
play-1.0.2.zip
play-1.0.1.zip
Upvotes: 1
Reputation: 91430
A perl way:
cat thefile | perl -anF'"' -e 'print $F[3],"\n";($v)=$F[3]=~/(\d.*\d)/;$m=$v if$v gt $m;}{print "max=$m\n";'
output:
play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip
max=1.0.2.1
Upvotes: 0
Reputation: 113
Awk is a great tool, if you know the field numbers:
awk -F\" '$4 ~ /play.*zip/{ print $4 }'
Or this is a kind of messy way; search for all zip files:
cat file | tr '"' '\n' | grep -e '.zip$' | sort -u
That will get all the zip files for you. The tr utility is underused a lot, it just does a character replacement, in this case replacing each double quote with a newline, nicely getting quoted data on its own line where you can grep it. The sort -u avoids dups.
Upvotes: 0
Reputation: 246837
With GNU tools, you can do
grep -oP '(?<=<td class="n"><a href=")[^"]+' | sort -Vr | head -1
Upvotes: 5
Reputation: 86
Contrary to other answers, this can be done entirely with grep.
Your output differs slightly from your input - there are extra elements showing up. For the purposes of this answer I'm going to use this file:
<tr><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.1.zip">play-1.0.2.1.zip</a></td><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.zip">play-1.0.2.zip</a></td><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>
There are a few things you need to do here. First, you need to set the correct grep switches. You need:
Now you can use the ? modifier to prevent greedy matching:
grep -o -P '<a href=".*?"' test.html
<a href="play-1.0.1.zip"
<a href="play-1.0.1.zip"
<a href="play-1.0.2.1.zip"
<a href="play-1.0.1.zip"
<a href="play-1.0.2.zip"
<a href="play-1.0.1.zip"
That's not quite right, so we'll anchor the regex to the first match of the line:
grep -o -P '^<tr><td class="n"><a href=".*?"' test.html
<tr><td class="n"><a href="play-1.0.1.zip"
<tr><td class="n"><a href="play-1.0.2.1.zip"
<tr><td class="n"><a href="play-1.0.2.zip"
That's the right data, but with too much cruft. What we need to use is zero width assertions (part of the PCRE syntax). Essentially bits of regular expression that do not count toward the matched pattern.
grep -o -P '(?<=^<tr><td class="n"><a href=").*?(?=")' test.html
play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip
Now you can do whatever you need to sort the list. More information on zero width assertions can be found here: http://www.regular-expressions.info/lookaround.html
Upvotes: 6
Reputation: 2029
$ grep 'href=' tmp.html | sed 's/.*href="\(.*\)".*/\1/'
play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip
Upvotes: 3
Reputation: 206719
grep
doesn't seem like the right tool for this, since you want to extract a submatch.
Here's a perl one-liner that would do it though:
$ perl -ne 'while(/<a href="([^"]+)"/g){print $1, "\n";}' input
play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip
Upvotes: 1
Reputation: 3181
try it with -E
switch:
piotrekkr@piotrekkr-desktop:~$ echo '<a href="play-1.0.1.zip">play-1.0.1.zip</a></td>' | grep -E '<a href=".*?"'
<a href="play-1.0.1.zip">play-1.0.1.zip</a></td>
Upvotes: 2