Hypernova
Hypernova

Reputation: 103

Why can I only scrape 16 photos from pixabay?

I need to get Backlight Image Data so I'm trying to get backlight images from pixabay. But only 16 images are downloaded by the following code.

I tried to find why, and I found the difference in the html source. The images that I downloaded are in the tag "img srcset", and my source downloads the first picture in the srcset. But the other pictures are in "img src", and my source can't download it. Does anyone know what is the problem??

Code

from bs4 import BeautifulSoup
import urllib.request
import os.path
url="https://pixabay.com/images/search/backlight/"
opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)
source = response.read()
soup = BeautifulSoup(source, "html.parser")
img = soup.find_all("img")
cnt = 0
for image in img:
    img_src=image.get("src")
    if img_src[0]=='/':
        continue
    cnt += 1
    print(img_src)
    path = "C:/Users/Guest001/Test/" + str(cnt) + ".jpg"
    print(path)
    urllib.request.urlretrieve(img_src, path)

Upvotes: 0

Views: 571

Answers (1)

Kostas Charitidis
Kostas Charitidis

Reputation: 3113

Some of the images have in src a /static/img/blank.gif and the real url is in the data-lazy attribute. Also some of the images have .png suffix. Here is a working example.

from bs4 import BeautifulSoup
import urllib.request
import os.path
url="https://pixabay.com/images/search/backlight/"
opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)
source = response.read()
soup = BeautifulSoup(source, "html.parser")
img = soup.find_all("img")
cnt = 0
for image in img:
    img_src= image.get("src") if '.gif' not in image.get("src") else image.get('data-lazy')
    if img_src[0]=='/':
        continue
    cnt += 1
    print(img_src)
    path = ''
    if '.jpg' in img_src:
        path = "C:/Users/Guest001/Test/" + str(cnt) + ".jpg"
    elif '.png' in img_src:
        path = "C:/Users/Guest001/Test/" + str(cnt) + ".png"
    print(path)
    urllib.request.urlretrieve(img_src, path)

Upvotes: 2

Related Questions