Reputation: 41
Im currently trying to scrape a website for all images found. My code successfully displays all images including .jpg, .bmp & .gif. However it also displays the height of these images as well. I was wondering how I could change my code to remove the height of the image from the output as well as tidying up the output providing just the clean links as shown in the attachment. Below I have attached both a link showing my codes output as well as my current code below. I have also attached what my ideal output would be. Thanks for any help, appreciated!
My Code Output: https://i.sstatic.net/eferl.jpg
Output I am looking for: https://i.sstatic.net/RytX4.jpg
files = re.findall(r'\<img .*\=.*', page.decode())
files.sort()
print (f'\n [+] {len(files)} IMAGES FOUND:\n')
for file in files:
print(file)
Upvotes: 0
Views: 42
Reputation: 5950
You can extract image src
directly
>>> images = ['<img src="demo.jpg" height=12>', '<img src="demo2.jpg" height=500>']
>>> for image in images:
print(re.search(r'<img[^>]*src="([^"]*)"', image).group(1))
demo.jpg
demo2.jpg
If your input is all string
, you may use findall
and then iterate over it
>>> images = '''<img src="demo.jog" height=12> <img src="demo.jog" height=500>'''
>>> res = re.findall(r'<img[^>]*src="([^"]*)"', images)
>>> for img in res:
print(img)
demo.jpg
demo2.jpg
Upvotes: 2
Reputation: 148975
Regex is not exactly the best tool to parse HTML or XML data, and BeautifulSoup is much more efficient and simple there. You could do:
from bs4 import BeautifulSoup
...
soup = BeautifulSoup(page.decode(), 'html.parser')
files = [ i.get("src") for i in soup.findAll('img') ] # get the src attribute for all img tags
files.sort()
print (f'\n [+] {len(files)} IMAGES FOUND:\n')
for file in files:
print(file)
That way, the HTML is effectively parsed and only real tags are returned.
Upvotes: 0