Reputation: 7792

Python Regex String Extraction

I'm trying to extract two strings from this string using Regular Expressions -

'<img src="http://images.efollett.com/books/978/047/012/9780470129296.gif" width="80" height="100" alt="Organic Chemistry I (as Second Language)" />'

I want the URL after src and the text after alt (so Organic Chemistry I (as Second Language)) and the url)

I've tried ('<img src=(\w+)" width'), ('<img src="(\w+)"') and ('src="(\w+)"\swidth'), for the url and all return empty.

I've also tried ('alt="(\w+)"') for the name and again, no luck.

Can anyone help?

Upvotes: 1

Answers (4)

Acorn

Reputation: 50587

Use lxml.

import lxml.html

html_string = '<img src="http://images.efollett.com/books/978/047/012/9780470129296.gif" width="80" height="100" alt="Organic Chemistry I (as Second Language)" />'

img = lxml.html.fromstring(html_string)

print "src:", img.get("src")
print "alt:", img.get("alt")

Gives:

src: http://images.efollett.com/books/978/047/012/9780470129296.gif
alt: Organic Chemistry I (as Second Language)

Upvotes: 3

scessor

Reputation: 16125

I don't know python, but may this regular expression helps?

<img.*?src="([^"]*)".*?alt="([^"]*)".*?>

Upvotes: 0

eph

Reputation: 2028

You can try r'<img[^>]*\ssrc="(.*?)"' and r'<img[^>]*\salt="(.*?)"'.

I don't know if you are dealing with HTML. [^>]* is to ensure inside brackets. \s is used to avoid some tags like "xxxsrc", and take care of newlines.

Upvotes: 1

Ray Toal

Reputation: 88478

Although you should not be parsing HTML with regexes, I can point out a common error here with regexes, which is your use of \w. That only matches A-Z, a-z, 0-9, and underscores. Not slashes, not parentheses. If you are trying to pull data out of attributes, use "([^"]*)" or "(.*?)"

Upvotes: 2

Python Regex String Extraction

Answers (4)

Related Questions