Reputation: 4196

One more greedy sed question

I'm doing an automated download of a number of images using an html frame source. So fra, so good, Sed, wget. Example of the frame source:

<td width="25%" align="center" valign="top"><a href="images/display.htm?concept_Core.jpg"><img border="1" src="t_core.gif" width="120" height="90"><font size="1" face="Verdana"><br>Hyperspace Core<br>(Rob Cunningham)</font></a></td>

So I do this:

sed -n -e 's/^.*htm?\(.*jpg\).*$/\1/p' concept.htm

to get the part which looks like this:

concept_Core.jpg

to do then this:

wget --base=/some/url/concept_Core.jpg

But there is one nasty line. That line, obvioulsy, is a bug in the site, or whatever it can be, but it is wrong, I can't change it, however. ;)

<td width="25%" bla bla face="Verdana"><a href="images/display.htm?concept_frigate16.jpg" target="_top"><img bla bla href="images/concept_frigate16.jpg" target="_top"><br>Frigate 16<br>

That is, two of these "concept_Frigate16.jpg" in a line. And my script gives me

concept_frigate16.jpg" target="_top"><img border="1" src="t_assaultfrigate.gif" width="120" height="90" alt="The '16' in the name may be a Sierra typo."></a><a href="images/concept_frigate16.jpg

You understand why. Sed is greedy and this obviously shows up in this case.

Now the question is, how do I get rid of this corner case? That is, make it non-greedy and make it stop on the FIRST .jpg?emphasized text

Upvotes: 0

Answers (5)

Dennis Williamson

Reputation: 360693

GNU grep can do PCRE:

grep -Po '(?<=\.htm\?).*?jpg' concept.htm

Upvotes: 0

khachik

Reputation: 28703

sed -n -e 's/^.*htm?$[^"]*jpg$.*$/\1/p'

Upvotes: 1

kovarex

Reputation: 1830

Use [^"] instead of . in the regular expression. This will pick all characters except the appostrophes.

Upvotes: 1

paxdiablo

Reputation: 882726

You might want to consider changing:

\(.*jpg\)

into:

\([^"]*jpg\)

This should stop your initial search going beyond the end of the first href. Whether that will introduce other problems (for other edge cases) is a little difficult to say given I don't know the full set of inputs.

If it does, you may want to opt for using a real parser rather than regexes. Regexes are a powerful tool but they're not necessarily suited for everything.

Upvotes: 1

ennuikiller

Reputation: 46985

use perl:

perl -pe 's/^.*htm?\(.*?jpg\).*$/\1/'

Upvotes: 2

One more greedy sed question

Answers (5)

Related Questions