Reputation: 79
I have a website where I'd like to get all the images from the website.
The website is kind of a dynamic in nature, I tried using google's Agenty Chrome extension and followed the steps:
This should yield me the result but it's not, it is returning an empty output.
Is there any better option? Will BS4 a better option for this? Any help is appreciated.
Upvotes: 0
Views: 8991
Reputation: 309
This site using CSS embedding to store images. If you check the source code you can find links which has https://images1.mcmaster.com/init/gfx/home/ those are the actual images but its actually stitched together (row of images)
import requests
import re
url=('https://www.mcmaster.com/')
image_urls = []
html_page = requests.get(url,headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'}).text
for values in re.findall('https://images1.mcmaster.com/init/gfx/home/.*[0-9]',html_page):
if str(values).startswith('http') and len(values) < 150:
image_urls.append(values.strip())
else:
for elements in values.split('background-image:url('):
for urls in re.findall('https://images1.mcmaster.com/init/gfx/home/.*[0-9]',elements):
urls = str(urls).split('")')[0]
image_urls.append(urls.strip())
print(len(image_urls))
print(image_urls)
Note: Scraping website is subject to copyrights
Upvotes: 0
Reputation: 2104
You can use Agenty Web Scraping Tool.
src
attribute from imagesHere is similar question answered on Agenty forum - https://forum.agenty.com/t/can-i-extract-images-from-website/24
Full Disclosure - I am working at Agenty
Upvotes: 0
Reputation: 968
I am assuming you want to download all images in the website. It is actually very easy to do this effectively using beautiful soup 4 (BS4).
#code to find all images in a given webpage
from bs4 import BeautifulSoup
import urllib.request
import requests
import shutil
url=('https://www.mcmaster.com/')
html_page = urllib.request.urlopen(url)
soup = BeautifulSoup(html_page, features="lxml")
for img in soup.findAll('img'):
assa=(img.get('src'))
new_image=(url+assa)
You can also download the image with this tacked-on to the end:
response = requests.get(my_url, stream=True)
with open('Mypic.bmp', 'wb') as file:
shutil.copyfileobj(response.raw, file)
Everything in two lines:
from bs4 import BeautifulSoup; import urllib.request; from urllib.request import urlretrieve
for img in (BeautifulSoup((urllib.request.urlopen("https://apod.nasa.gov/apod/astropix.html")), features="lxml")).findAll('img'): assa=(img.get('src')); urlretrieve(("https://apod.nasa.gov/apod/"+assa), "Mypic.bmp")
The new image should be in the same directory as the python file, but can be moved with:
os.rename()
In the case of the McMaster website, the images are linked differently, so the above methods won't work. The following code should get most of the images on the website:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import urllib.request
import shutil
import requests
req = Request("https://www.mcmaster.com/")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
links = []
for link in soup.findAll('link'):
links.append(link.get('href'))
print(links)
UPDATE: I found from some github post the below code that is MUCH more accurate:
import requests
import re
image_link_home=("https://images1.mcmaster.com/init/gfx/home/.*[0-9]")
html_page = requests.get(('https://www.mcmaster.com/'),headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'}).text
for item in re.findall(image_link_home,html_page):
if str(item).startswith('http') and len(item) < 150:
print(item.strip())
else:
for elements in item.split('background-image:url('):
for item in re.findall(image_link_home,elements):
print((str(item).split('")')[0]).strip())
Hope this helps!
Upvotes: 1
Reputation: 1577
You should use scrapy, it makes the crawling seamless, by selecting the content you wish to download with css tags You can automate the crawling easily.
Upvotes: 0