Joe Bloggs
Joe Bloggs

Reputation: 41

Issue displaying Images from website using Regex

Im currently trying to scrape a website for all images found. My code successfully displays all images including .jpg, .bmp & .gif. However it also displays the height of these images as well. I was wondering how I could change my code to remove the height of the image from the output as well as tidying up the output providing just the clean links as shown in the attachment. Below I have attached both a link showing my codes output as well as my current code below. I have also attached what my ideal output would be. Thanks for any help, appreciated!

My Code Output: https://i.sstatic.net/eferl.jpg

Output I am looking for: https://i.sstatic.net/RytX4.jpg

files = re.findall(r'\<img .*\=.*', page.decode())
files.sort()
print (f'\n [+] {len(files)} IMAGES FOUND:\n')
for file in files:
    print(file)

Upvotes: 0

Views: 42

Answers (2)

akash karothiya
akash karothiya

Reputation: 5950

You can extract image src directly

>>> images = ['<img src="demo.jpg" height=12>', '<img src="demo2.jpg" height=500>']
>>> for image in images:
        print(re.search(r'<img[^>]*src="([^"]*)"', image).group(1))

demo.jpg
demo2.jpg

If your input is all string, you may use findall and then iterate over it

>>> images = '''<img src="demo.jog" height=12> <img src="demo.jog" height=500>'''
>>> res = re.findall(r'<img[^>]*src="([^"]*)"', images)
>>> for img in res:
        print(img)
demo.jpg
demo2.jpg

Upvotes: 2

Serge Ballesta
Serge Ballesta

Reputation: 148975

Regex is not exactly the best tool to parse HTML or XML data, and BeautifulSoup is much more efficient and simple there. You could do:

from bs4 import BeautifulSoup

...    
soup = BeautifulSoup(page.decode(), 'html.parser')
files = [ i.get("src") for i in soup.findAll('img') ]  # get the src attribute for all img tags
files.sort()
print (f'\n [+] {len(files)} IMAGES FOUND:\n')
for file in files:
    print(file)

That way, the HTML is effectively parsed and only real tags are returned.

Upvotes: 0

Related Questions