opensas
opensas

Reputation: 63445

How to stop greediness using grep from bash

I have an html page with the following content:

[...]
<tr><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.1.zip">play-1.0.2.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.zip">play-1.0.2.zip</a></td></tr>
[...]

And I'd like to extract just

play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip

to then find the latest version (in this case it would be play-1.0.2.1.zip)

So I tried with

cat tmp.html | grep "<a href=\".*\""

<a href="play-1.0.1.zip">play-1.0.1.zip</a></td><td class="m"
<a href="play-1.0.2.1.zip">play-1.0.2.1.zip</a></td><td class="m"
<a href="play-1.0.2.zip">play-1.0.2.zip</a></td><td class="m"

So I tried with lazy:

cat tmp.html | grep "<a href=\".*?\""

and negating the quotes

cat tmp.html | grep "<a href=\"[^\"]*?\""

both of them returning nothing

I need to get only the matching part (not the href), and then to find the latest, but I'm stuck with this greediness problem...

--

thanks a lot for all the answers, they were all pretty useful, it's hard to decide which one is correct, in the end I've solved it with:

grep -v '.*-RC.*' index.html | grep -oP 'play-1.*?.zip' | sort -Vru | head -1

Upvotes: 3

Views: 555

Answers (9)

jokmi
jokmi

Reputation: 31

Didn't see cut (and I like it for its brevity & speed) so:

cut -d\" -f4 tmp.html | sort -Vu | tail -1

output:

play-1.0.2.1.zip

Upvotes: 3

E1Suave
E1Suave

Reputation: 278

Using the answer provided by Craig Andrews with the addition of OSX support.

grep -o -P '(?<=^<tr><td class="n"><a href=").*?(?=")' /test.html | sort -n -r -k1.10,12

Result:

play-1.0.2.1.zip
play-1.0.2.zip
play-1.0.1.zip

Upvotes: 1

Toto
Toto

Reputation: 91430

A perl way:

cat thefile | perl -anF'"' -e 'print $F[3],"\n";($v)=$F[3]=~/(\d.*\d)/;$m=$v if$v gt $m;}{print "max=$m\n";'

output:

play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip
max=1.0.2.1

Upvotes: 0

ptau
ptau

Reputation: 113

Awk is a great tool, if you know the field numbers:

awk -F\" '$4 ~ /play.*zip/{ print $4 }'

Or this is a kind of messy way; search for all zip files:

cat file | tr '"' '\n' | grep -e '.zip$' | sort -u

That will get all the zip files for you. The tr utility is underused a lot, it just does a character replacement, in this case replacing each double quote with a newline, nicely getting quoted data on its own line where you can grep it. The sort -u avoids dups.

Upvotes: 0

glenn jackman
glenn jackman

Reputation: 246837

With GNU tools, you can do

grep -oP '(?<=<td class="n"><a href=")[^"]+' | sort -Vr | head -1

Upvotes: 5

Craig Andrews
Craig Andrews

Reputation: 86

Contrary to other answers, this can be done entirely with grep.

Your output differs slightly from your input - there are extra elements showing up. For the purposes of this answer I'm going to use this file:

<tr><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.1.zip">play-1.0.2.1.zip</a></td><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>
<tr><td class="n"><a href="play-1.0.2.zip">play-1.0.2.zip</a></td><td class="n"><a href="play-1.0.1.zip">play-1.0.1.zip</a></td></tr>

There are a few things you need to do here. First, you need to set the correct grep switches. You need:

  • -o to only output the matched portion of each line
  • -P to use the Perl compatible regex engine

Now you can use the ? modifier to prevent greedy matching:

grep -o -P '<a href=".*?"' test.html

<a href="play-1.0.1.zip"
<a href="play-1.0.1.zip"
<a href="play-1.0.2.1.zip"
<a href="play-1.0.1.zip"
<a href="play-1.0.2.zip"
<a href="play-1.0.1.zip"

That's not quite right, so we'll anchor the regex to the first match of the line:

grep -o -P '^<tr><td class="n"><a href=".*?"' test.html

<tr><td class="n"><a href="play-1.0.1.zip"
<tr><td class="n"><a href="play-1.0.2.1.zip"
<tr><td class="n"><a href="play-1.0.2.zip"

That's the right data, but with too much cruft. What we need to use is zero width assertions (part of the PCRE syntax). Essentially bits of regular expression that do not count toward the matched pattern.

grep -o -P '(?<=^<tr><td class="n"><a href=").*?(?=")' test.html

play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip

Now you can do whatever you need to sort the list. More information on zero width assertions can be found here: http://www.regular-expressions.info/lookaround.html

Upvotes: 6

strkol
strkol

Reputation: 2029

$ grep 'href=' tmp.html | sed 's/.*href="\(.*\)".*/\1/'
play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip

Upvotes: 3

Mat
Mat

Reputation: 206719

grep doesn't seem like the right tool for this, since you want to extract a submatch.

Here's a perl one-liner that would do it though:

$ perl -ne 'while(/<a href="([^"]+)"/g){print $1, "\n";}' input 
play-1.0.1.zip
play-1.0.2.1.zip
play-1.0.2.zip

Upvotes: 1

piotrekkr
piotrekkr

Reputation: 3181

try it with -E switch:

piotrekkr@piotrekkr-desktop:~$ echo '<a href="play-1.0.1.zip">play-1.0.1.zip</a></td>' | grep -E '<a href=".*?"'
<a href="play-1.0.1.zip">play-1.0.1.zip</a></td>

Upvotes: 2

Related Questions