Reputation: 4854
I tried something like this but failed. I don't know regex can anyone help me with this?
import re
html = """
<body>
<h1>dummy heading</h1>
<img src="/pic/earth.jpg" alt="planet" width="200">
<img src="/pic/redrose.jpg" alt="flower" width="200">
</body>
"""
x = re.search('^src=".*jpg$', html)
print(x)
I'm expecting output like this ['/pic/earth.jpg','/pic/redrose.jpg']
Upvotes: 0
Views: 1875
Reputation: 8086
Good first start, but you have several minor issues with your code:
^
and $
refer to the start and end of the string
.search()
returns Null
or a Match
object rather than the matched strings.findall()
methodr"string"
strings for your regex code'
and "
for quotes, and that there could be a src=
attribute in something that is not an imageHere are the docs: - https://docs.python.org/3/library/re.html#re.findall
Try this as a regex:
image_urls = re.findall(r'<img[^<>]+src=["\']([^"\'<>]+\.(?:gif|png|jpe?g))["\']', html, re.I)
print(image_urls)
>>> ['/pic/earth.jpg', '/pic/redrose.jpg']
To break this down a little:
re.findall()
return a list of strings<img
we are looking to start in an image tag [^<>]+
1 or more chars that don't open/close the html tag
src=""
tag in the current <img>
["\']
the HTML could use either type of quote[^"\'<>]+
keep reading 1+ chars whilst the string and the tag are not closed\.
literal dots need to be escaped, else they mean the "match anything" special char(?:gif|png|jpe?g)
a range of possible file extensions, but don't create a capture bracket for them (which would return these in your array)([^"\'<>]+\.(?:gif|png|jpe?g))
this is the capture bracket for what will actually get returned for each match["\']
search for the closing quote to end the capture bracket re.I
make the regex case insensitive Upvotes: 3
Reputation: 48
I'm not good at regEx. So my answer may not be best.
Try this.
x = re.findall(r'(?=src)src=\"(?P<src>[^\"]+)', html)
than you can see x like below.
['/pic/earth.jpg', '/pic/redrose.jpg']
RegEx explanation :
(?=src) : positive lookup --> only see those have src word
src=\" : must include this specific word src="
(?P somthing) : this expression grouping somthing to name src
[^\"]+ : everything except " character
Upvotes: 2