user11409134
user11409134

Reputation: 79

How can I scrape all the images from a website?

I have a website where I'd like to get all the images from the website.

The website is kind of a dynamic in nature, I tried using google's Agenty Chrome extension and followed the steps:

This should yield me the result but it's not, it is returning an empty output.

Is there any better option? Will BS4 a better option for this? Any help is appreciated.

Upvotes: 0

Views: 8991

Answers (4)

Dhamodharan
Dhamodharan

Reputation: 309

This site using CSS embedding to store images. If you check the source code you can find links which has https://images1.mcmaster.com/init/gfx/home/ those are the actual images but its actually stitched together (row of images)

Example : https://images1.mcmaster.com/init/gfx/home/Fastening-and-Joining-Fasteners-sprite-60.png?ver=1539608820

import requests
import re
url=('https://www.mcmaster.com/')
image_urls = []
html_page = requests.get(url,headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'}).text
for values in re.findall('https://images1.mcmaster.com/init/gfx/home/.*[0-9]',html_page):
    if str(values).startswith('http') and len(values) < 150:
        image_urls.append(values.strip())
    else:
        for elements in values.split('background-image:url('):
            for urls in re.findall('https://images1.mcmaster.com/init/gfx/home/.*[0-9]',elements):
                urls = str(urls).split('")')[0]
                image_urls.append(urls.strip())
print(len(image_urls))
print(image_urls)

Note: Scraping website is subject to copyrights

Upvotes: 0

Vikash Rathee
Vikash Rathee

Reputation: 2104

You can use Agenty Web Scraping Tool.

  1. Setup your scraper using Chrome extension to extract src attribute from images
  2. Save the agent to run on cloud.

Here is similar question answered on Agenty forum - https://forum.agenty.com/t/can-i-extract-images-from-website/24

images scraping

Full Disclosure - I am working at Agenty

Upvotes: 0

Andereoo
Andereoo

Reputation: 968

I am assuming you want to download all images in the website. It is actually very easy to do this effectively using beautiful soup 4 (BS4).

#code to find all images in a given webpage

from bs4 import BeautifulSoup
import urllib.request
import requests
import shutil

url=('https://www.mcmaster.com/')
html_page = urllib.request.urlopen(url)
soup = BeautifulSoup(html_page, features="lxml")
for img in soup.findAll('img'):
    assa=(img.get('src'))
new_image=(url+assa)

You can also download the image with this tacked-on to the end:

response = requests.get(my_url, stream=True)
with open('Mypic.bmp', 'wb') as file:
    shutil.copyfileobj(response.raw, file)

Everything in two lines:

from bs4 import BeautifulSoup; import urllib.request; from urllib.request import urlretrieve
for img in (BeautifulSoup((urllib.request.urlopen("https://apod.nasa.gov/apod/astropix.html")), features="lxml")).findAll('img'): assa=(img.get('src')); urlretrieve(("https://apod.nasa.gov/apod/"+assa), "Mypic.bmp")

The new image should be in the same directory as the python file, but can be moved with:

os.rename()

In the case of the McMaster website, the images are linked differently, so the above methods won't work. The following code should get most of the images on the website:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import urllib.request
import shutil
import requests
req = Request("https://www.mcmaster.com/")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, "lxml")

links = []

for link in soup.findAll('link'):
    links.append(link.get('href'))

print(links)

UPDATE: I found from some github post the below code that is MUCH more accurate:

import requests
import re
image_link_home=("https://images1.mcmaster.com/init/gfx/home/.*[0-9]")
html_page = requests.get(('https://www.mcmaster.com/'),headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'}).text
for item in re.findall(image_link_home,html_page):
    if str(item).startswith('http') and len(item) < 150:
        print(item.strip())
    else:
        for elements in item.split('background-image:url('):
            for item in re.findall(image_link_home,elements):
                print((str(item).split('")')[0]).strip())

Hope this helps!

Upvotes: 1

Benjamin Breton
Benjamin Breton

Reputation: 1577

You should use scrapy, it makes the crawling seamless, by selecting the content you wish to download with css tags You can automate the crawling easily.

Upvotes: 0

Related Questions