Swaroop Maddu
Swaroop Maddu

Reputation: 4854

Python Regex to extract content of src of an html tag?

I tried something like this but failed. I don't know regex can anyone help me with this?

import re

html = """
<body>
<h1>dummy heading</h1>
<img src="/pic/earth.jpg" alt="planet" width="200">
<img src="/pic/redrose.jpg" alt="flower" width="200">
</body>
"""
x = re.search('^src=".*jpg$', html)
print(x)

I'm expecting output like this ['/pic/earth.jpg','/pic/redrose.jpg']

Upvotes: 0

Views: 1875

Answers (2)

James McGuigan
James McGuigan

Reputation: 8086

Good first start, but you have several minor issues with your code:

  • ^ and $ refer to the start and end of the string
    • or end-of-line with re.MULTILINE flag enabled
  • .search() returns Null or a Match object rather than the matched strings
  • you probably want the .findall() method
  • if you have backslashed in your regex (which you don't yet), then you may want to use raw r"string" strings for your regex code
  • also think of all the possible permutations of what could be in your input data, such as HTML allowing both ' and " for quotes, and that there could be a src= attribute in something that is not an image

Here are the docs: - https://docs.python.org/3/library/re.html#re.findall

Try this as a regex:

image_urls = re.findall(r'<img[^<>]+src=["\']([^"\'<>]+\.(?:gif|png|jpe?g))["\']', html, re.I)
print(image_urls)
>>> ['/pic/earth.jpg', '/pic/redrose.jpg']

To break this down a little:

  • re.findall() return a list of strings
  • <img we are looking to start in an image tag
  • [^<>]+ 1 or more chars that don't open/close the html tag
    • there might not be a src="" tag in the current <img>
  • ["\'] the HTML could use either type of quote
  • [^"\'<>]+ keep reading 1+ chars whilst the string and the tag are not closed
  • \. literal dots need to be escaped, else they mean the "match anything" special char
  • (?:gif|png|jpe?g) a range of possible file extensions, but don't create a capture bracket for them (which would return these in your array)
  • ([^"\'<>]+\.(?:gif|png|jpe?g)) this is the capture bracket for what will actually get returned for each match
  • ["\'] search for the closing quote to end the capture bracket
  • re.I make the regex case insensitive

Upvotes: 3

Shinbeom Choi
Shinbeom Choi

Reputation: 48

I'm not good at regEx. So my answer may not be best.

Try this.

x = re.findall(r'(?=src)src=\"(?P<src>[^\"]+)', html)

than you can see x like below.

['/pic/earth.jpg', '/pic/redrose.jpg']

RegEx explanation :

(?=src) : positive lookup --> only see those have src word

src=\" : must include this specific word src="

(?P somthing) : this expression grouping somthing to name src

[^\"]+ : everything except " character

Upvotes: 2

Related Questions