Reputation: 4854

Python Regex to extract content of src of an html tag?

I tried something like this but failed. I don't know regex can anyone help me with this?

import re

html = """
<body>
<h1>dummy heading</h1>
<img src="/pic/earth.jpg" alt="planet" width="200">
<img src="/pic/redrose.jpg" alt="flower" width="200">
</body>
"""
x = re.search('^src=".*jpg$', html)
print(x)

I'm expecting output like this ['/pic/earth.jpg','/pic/redrose.jpg']

Upvotes: 0

Answers (2)

James McGuigan

Reputation: 8086

Good first start, but you have several minor issues with your code:

^ and $ refer to the start and end of the string
- or end-of-line with re.MULTILINE flag enabled
.search() returns Null or a Match object rather than the matched strings
you probably want the .findall() method
if you have backslashed in your regex (which you don't yet), then you may want to use raw r"string" strings for your regex code
also think of all the possible permutations of what could be in your input data, such as HTML allowing both ' and " for quotes, and that there could be a src= attribute in something that is not an image

Here are the docs: - https://docs.python.org/3/library/re.html#re.findall

Try this as a regex:

image_urls = re.findall(r'<img[^<>]+src=["\']([^"\'<>]+\.(?:gif|png|jpe?g))["\']', html, re.I)
print(image_urls)
>>> ['/pic/earth.jpg', '/pic/redrose.jpg']

To break this down a little:

re.findall() return a list of strings
<img we are looking to start in an image tag
[^<>]+ 1 or more chars that don't open/close the html tag
- there might not be a src="" tag in the current <img>
["\'] the HTML could use either type of quote
[^"\'<>]+ keep reading 1+ chars whilst the string and the tag are not closed
\. literal dots need to be escaped, else they mean the "match anything" special char
(?:gif|png|jpe?g) a range of possible file extensions, but don't create a capture bracket for them (which would return these in your array)
([^"\'<>]+\.(?:gif|png|jpe?g)) this is the capture bracket for what will actually get returned for each match
["\'] search for the closing quote to end the capture bracket
re.I make the regex case insensitive

Upvotes: 3

Shinbeom Choi

Reputation: 48

I'm not good at regEx. So my answer may not be best.

Try this.

x = re.findall(r'(?=src)src=\"(?P<src>[^\"]+)', html)

than you can see x like below.

['/pic/earth.jpg', '/pic/redrose.jpg']

RegEx explanation :

(?=src) : positive lookup --> only see those have src word

src=\" : must include this specific word src="

(?P somthing) : this expression grouping somthing to name src

[^\"]+ : everything except " character

Upvotes: 2

Python Regex to extract content of src of an html tag?

Answers (2)

Related Questions