I have a website where I'd like to get all the images from the website . The website is kind of a dynamic in nature, I tried using google's Agenty Chrome extension and followed the steps: I Choose one image that I want to extract using CSS selector, this will make the extension select the same other images automatically. Viewed the Show button and select ATTR(attribute). Changed src as an ATTR field. Gave a name field name option. Saved it & ran it in using Agenty platform/API. This should yield me the result but it's not, it is returning an empty output. Is there any better option? Will BS4 a better option for this? Any help is appreciated.

pythonweb-scraping

user11409134

Reputation: 79

How can I scrape all the images from a website?

I have a website where I'd like to get all the images from the website.

The website is kind of a dynamic in nature, I tried using google's Agenty Chrome extension and followed the steps:

I Choose one image that I want to extract using CSS selector, this will make the extension select the same other images automatically.
Viewed the Show button and select ATTR(attribute).
Changed src as an ATTR field.
Gave a name field name option.
Saved it & ran it in using Agenty platform/API.

This should yield me the result but it's not, it is returning an empty output.

Is there any better option? Will BS4 a better option for this? Any help is appreciated.

Upvotes: 0

Answers (4)

Dhamodharan

Reputation: 309

This site using CSS embedding to store images. If you check the source code you can find links which has https://images1.mcmaster.com/init/gfx/home/ those are the actual images but its actually stitched together (row of images)

Example : https://images1.mcmaster.com/init/gfx/home/Fastening-and-Joining-Fasteners-sprite-60.png?ver=1539608820

import requests
import re
url=('https://www.mcmaster.com/')
image_urls = []
html_page = requests.get(url,headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'}).text
for values in re.findall('https://images1.mcmaster.com/init/gfx/home/.*[0-9]',html_page):
    if str(values).startswith('http') and len(values) < 150:
        image_urls.append(values.strip())
    else:
        for elements in values.split('background-image:url('):
            for urls in re.findall('https://images1.mcmaster.com/init/gfx/home/.*[0-9]',elements):
                urls = str(urls).split('")')[0]
                image_urls.append(urls.strip())
print(len(image_urls))
print(image_urls)

Note: Scraping website is subject to copyrights

Upvotes: 0

Vikash Rathee

Reputation: 2104

You can use Agenty Web Scraping Tool.

Setup your scraper using Chrome extension to extract src attribute from images
Save the agent to run on cloud.

Here is similar question answered on Agenty forum - https://forum.agenty.com/t/can-i-extract-images-from-website/24

Full Disclosure - I am working at Agenty

Upvotes: 0

Andereoo

Reputation: 968

I am assuming you want to download all images in the website. It is actually very easy to do this effectively using beautiful soup 4 (BS4).

#code to find all images in a given webpage

from bs4 import BeautifulSoup
import urllib.request
import requests
import shutil

url=('https://www.mcmaster.com/')
html_page = urllib.request.urlopen(url)
soup = BeautifulSoup(html_page, features="lxml")
for img in soup.findAll('img'):
    assa=(img.get('src'))
new_image=(url+assa)

You can also download the image with this tacked-on to the end:

response = requests.get(my_url, stream=True)
with open('Mypic.bmp', 'wb') as file:
    shutil.copyfileobj(response.raw, file)

Everything in two lines:

from bs4 import BeautifulSoup; import urllib.request; from urllib.request import urlretrieve
for img in (BeautifulSoup((urllib.request.urlopen("https://apod.nasa.gov/apod/astropix.html")), features="lxml")).findAll('img'): assa=(img.get('src')); urlretrieve(("https://apod.nasa.gov/apod/"+assa), "Mypic.bmp")

The new image should be in the same directory as the python file, but can be moved with:

os.rename()

In the case of the McMaster website, the images are linked differently, so the above methods won't work. The following code should get most of the images on the website:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import urllib.request
import shutil
import requests
req = Request("https://www.mcmaster.com/")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, "lxml")

links = []

for link in soup.findAll('link'):
    links.append(link.get('href'))

print(links)

UPDATE: I found from some github post the below code that is MUCH more accurate:

import requests
import re
image_link_home=("https://images1.mcmaster.com/init/gfx/home/.*[0-9]")
html_page = requests.get(('https://www.mcmaster.com/'),headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'}).text
for item in re.findall(image_link_home,html_page):
    if str(item).startswith('http') and len(item) < 150:
        print(item.strip())
    else:
        for elements in item.split('background-image:url('):
            for item in re.findall(image_link_home,elements):
                print((str(item).split('")')[0]).strip())

Hope this helps!

Upvotes: 1

Benjamin Breton

Reputation: 1577

You should use scrapy, it makes the crawling seamless, by selecting the content you wish to download with css tags You can automate the crawling easily.

Upvotes: 0

How can I scrape all the images from a website?

Answers (4)

Related Questions