Reputation: 2349
I'm crawling through an HTML page and I want to extract the img srcs and the a hrefs.
On the particular site, all of them are encapsulated in double quotes.
I've tried a wide variety of regexps with no success. Assume characters inside the double-quotes will be [-\w/] (printable characters [a-zA-Z\d-_] and / and .)
In python:
re.search(r'img\s+src="(?P<src>[\w-/]+_"', line)
Doesn't return anything, but
re.search(r'img\s+src="(?P[-\w[/]]+)"', line)
Returns wayy to much (i.e., does not stop at the " ).
I need help creating the right regexp. Thanks in advance!
Upvotes: 0
Views: 188
Reputation: 838416
I need help creating the right regexp.
No, you need help in finding the right tool.
Try BeautifulSoup.
(If you insist on using regular expressions - and I'd advise against it - try changing the greedy +
to non-greedy +?
).
Upvotes: 6
Reputation: 37441
Here's an example of a better way to do it than with regex, using the excellent lxml library and xpath
In [1]: import lxml.html
In [2]: doc = lxml.html.parse('http://www.google.com/search?q=kittens&tbm=isch')
In [3]: doc.xpath('//img/@src')
Out[3]:
['/images/nav_logo_hp2.png',
'http://t1.gstatic.com/images?q=tbn:ANd9GcQhajNZimPGLw9iTfzrAF_HV5UogY-KGep5WYgw-VHZ15oaAwGquNb5Q2I',
'http://t2.gstatic.com/images?q=tbn:ANd9GcS1LgVIlDgoIfNzwU4xBz9fL32ZJjZU26aB4aynRsEcz2VuXmjCtvxUonM',
'http://t1.gstatic.com/images?q=tbn:ANd9GcRgouJt5Moe8uTnDPUFTo4csZOcBtEDA_B7WdRPe8pdZroR5QB2q_-LT59G',
[...]
]
Upvotes: 5
Reputation: 7113
A good trick for finding things inside quotes you do "([^"]+)"
. So you search for any characters but the quote that are between quotes.
For help with creating regular expressions I can strongly recommend Expresso ( http://www.ultrapico.com/Expresso.htm )
Upvotes: 2