Reputation: 7792
I'm trying to extract two strings from this string using Regular Expressions -
'<img src="http://images.efollett.com/books/978/047/012/9780470129296.gif" width="80" height="100" alt="Organic Chemistry I (as Second Language)" />'
I want the URL after src and the text after alt (so Organic Chemistry I (as Second Language)) and the url)
I've tried ('<img src=(\w+)" width')
, ('<img src="(\w+)"')
and ('src="(\w+)"\swidth')
, for the url and all return empty.
I've also tried ('alt="(\w+)"')
for the name and again, no luck.
Can anyone help?
Upvotes: 1
Views: 1691
Reputation: 50497
Use lxml
.
import lxml.html
html_string = '<img src="http://images.efollett.com/books/978/047/012/9780470129296.gif" width="80" height="100" alt="Organic Chemistry I (as Second Language)" />'
img = lxml.html.fromstring(html_string)
print "src:", img.get("src")
print "alt:", img.get("alt")
Gives:
src: http://images.efollett.com/books/978/047/012/9780470129296.gif alt: Organic Chemistry I (as Second Language)
Upvotes: 3
Reputation: 16115
I don't know python, but may this regular expression helps?
<img.*?src="([^"]*)".*?alt="([^"]*)".*?>
Upvotes: 0
Reputation: 2028
You can try r'<img[^>]*\ssrc="(.*?)"'
and r'<img[^>]*\salt="(.*?)"'
.
I don't know if you are dealing with HTML. [^>]*
is to ensure inside brackets. \s
is used to avoid some tags like "xxxsrc", and take care of newlines.
Upvotes: 1
Reputation: 88378
Although you should not be parsing HTML with regexes, I can point out a common error here with regexes, which is your use of \w
. That only matches A-Z, a-z, 0-9, and underscores. Not slashes, not parentheses. If you are trying to pull data out of attributes, use "([^"]*)"
or "(.*?)"
Upvotes: 2