Bill
Bill

Reputation: 2349

Regexp to parse HTML imgs

I'm crawling through an HTML page and I want to extract the img srcs and the a hrefs.

On the particular site, all of them are encapsulated in double quotes.

I've tried a wide variety of regexps with no success. Assume characters inside the double-quotes will be [-\w/] (printable characters [a-zA-Z\d-_] and / and .)

In python:

re.search(r'img\s+src="(?P<src>[\w-/]+_"', line)

Doesn't return anything, but

re.search(r'img\s+src="(?P[-\w[/]]+)"', line)

Returns wayy to much (i.e., does not stop at the " ).

I need help creating the right regexp. Thanks in advance!

Upvotes: 0

Views: 188

Answers (3)

Mark Byers
Mark Byers

Reputation: 838416

I need help creating the right regexp.

No, you need help in finding the right tool.

Try BeautifulSoup.

(If you insist on using regular expressions - and I'd advise against it - try changing the greedy + to non-greedy +?).

Upvotes: 6

Daenyth
Daenyth

Reputation: 37441

Here's an example of a better way to do it than with regex, using the excellent lxml library and xpath


In [1]: import lxml.html
In [2]: doc = lxml.html.parse('http://www.google.com/search?q=kittens&tbm=isch')
In [3]: doc.xpath('//img/@src')
Out[3]: 
['/images/nav_logo_hp2.png',
 'http://t1.gstatic.com/images?q=tbn:ANd9GcQhajNZimPGLw9iTfzrAF_HV5UogY-KGep5WYgw-VHZ15oaAwGquNb5Q2I',
 'http://t2.gstatic.com/images?q=tbn:ANd9GcS1LgVIlDgoIfNzwU4xBz9fL32ZJjZU26aB4aynRsEcz2VuXmjCtvxUonM',
 'http://t1.gstatic.com/images?q=tbn:ANd9GcRgouJt5Moe8uTnDPUFTo4csZOcBtEDA_B7WdRPe8pdZroR5QB2q_-LT59G',
 [...]
]

Upvotes: 5

OlliM
OlliM

Reputation: 7113

A good trick for finding things inside quotes you do "([^"]+)". So you search for any characters but the quote that are between quotes.

For help with creating regular expressions I can strongly recommend Expresso ( http://www.ultrapico.com/Expresso.htm )

Upvotes: 2

Related Questions