Moaz Nasem
Moaz Nasem

Reputation: 3

how to load the full web page before start downloading with requests.get()?

My program asks user for a keyword, downloads all the images from the https://www.pexels.com/ and stores them in a folder on the hard-drive.
The problem is, it downloads only the first 30 pictures that appears when the page loads, but doesn't take into consideration that when I scroll down, more images load in the page.

I want my program "scroll down" the page and download all images. Here is my code:

#! /usr/bin/python3 
import os, requests, bs4
keyword = input('Enter one-word search keyword: ')
url = 'https://www.pexels.com/search/' + keyword
res = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36"})
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
tagObj = soup.select('.photo-item__img')

if tagObj == []:
    print('Sorry, no pictures found!')
else:
    print(len(tagObj))
    os.makedirs(str(keyword), exist_ok=True)
    for i in range(len(tagObj)):
        imgUrl = tagObj[i].get('srcset')
        print('Downloading img %s' %imgUrl)
        res = requests.get(imgUrl, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36"})
        res.raise_for_status()
        # open img file for binary writing.
        imgFile = open(os.path.join(str(keyword), os.path.basename(imgUrl)), 'wb')
        for chunk in res.iter_content(100000):
            imgFile.write(chunk)
        imgFile.close()
    print('Done.')

Upvotes: 0

Views: 858

Answers (1)

Kamal
Kamal

Reputation: 2554

The webpage is loading only first 30 results and loads more with xhr request when scrolled. Using devtools of browser, I found the actual xhr request and then used it to get all the data.

Sample URL for XHR get request:

https://www.pexels.com/search/iphone%20x/?format=js&seed=2019-03-19%2B06%3A31%3A20%2B%2B0000&page=2

Sample response from the request:

;(function() {
  var infiniteScrollingAppender = window.Pexels.PhotoGrid.infiniteScrollingAppender({
    currentPage: 26,
    totalPages: 26,
    paginationHtml: '<div class=\"pagination\"><a class=\"previous_page\" rel=\"prev\" href=\"/search/iphone%20x/?page=25&amp;seed=2019-03-19%2B06%3A31%3A20%2B%2B0000\">Previous<\/a> <a href=\"/search/iphone%20x/?page=1&amp;seed=2019-03-19%2B06%3A31%3A20%2B%2B0000\">1<\/a> <a href=\"/search/iphone%20x/?page=2&amp;seed=2019-03-19%2B06%3A31%3A20%2B%2B0000\">2<\/a> <span class=\"gap\">&hellip;<\/span> <a href=\"/search/iphone%20x/?page=18&amp;seed=2019-03-19%2B06%3A31%3A20%2B%2B0000\">18<\/a> <a href=\"/search/iphone%20x/?page=19&amp;seed=2019-03-19%2B06%3A31%3A20%2B%2B0000\">19<\/a> <a href=\"/search/iphone%20x/?page=20&amp;seed=2019-03-19%2B06%3A31%3A20%2B%2B0000\">20<\/a> <a href=\"/search/iphone%20x/?page=21&amp;seed=2019-03-19%2B06%3A31%3A20%2B%2B0000\">21<\/a> <a href=\"/search/iphone%20x/?page=22&amp;seed=2019-03-19%2B06%3A31%3A20%2B%2B0000\">22<\/a> <a href=\"/search/iphone%20x/?page=23&amp;seed=2019-03-19%2B06%3A31%3A20%2B%2B0000\">23<\/a> <a href=\"/search/iphone%20x/?page=24&amp;seed=2019-03-19%2B06%3A31%3A20%2B%2B0000\">24<\/a> <a rel=\"prev\" href=\"/search/iphone%20x/?page=25&amp;seed=2019-03-19%2B06%3A31%3A20%2B%2B0000\">25<\/a> <em class=\"current\">26<\/em> <span class=\"next_page disabled\">Next<\/span><\/div>',
    inlineSponsoredPhotosUrl: '/sponsored_photos/8/inline/?query=iphone+x'
  });

  infiniteScrollingAppender.append('<div class=\'hide-featured-badge  hide-favorite-badge\'>\n<article class=\'photo-item photo-item--overlay\' data-aspect-ratio=\'1.5\' data-meta-title=\'Person Holding Silver Iphone 5s · Free Stock Photo\' data-photo-modal-aspect-ratio=\'1.5\' data-photo-modal-can-accept-donations data-photo-modal-download-text-large=\'&lt;strong&gt;Large&lt;/strong&gt; (1920 x 1280)\' data-photo-modal-download-text-medium=\'&lt;strong&gt;Medium&lt;/strong&gt; (1280 x 853)\' data-photo-modal-download-text-original=\'&lt;strong&gt;Original&lt;/strong&gt; (3504 x 2336)\' data-photo-modal-download-text-small=\'&lt;strong&gt;Small&lt;/strong&gt; (640 x 426)\' data-photo-modal-download-url=\'https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?cs=srgb&amp;dl=adult-hairy-hand-986835.jpg&amp;fm=jpg\' data-photo-modal-download-value-large=\'1920x1280\' data-photo-modal-download-value-medium=\'1280x853\' data-photo-modal-download-value-original=\'3504x2336\' data-photo-modal-download-value-small=\'640x426\' data-photo-modal-height=\'2336\' data-photo-modal-image-alt=\'Person Holding Silver Iphone 5s\' data-photo-modal-image-details-description=\'\' data-photo-modal-image-details-license=\'Free to use\' data-photo-modal-image-details-license-link=\'/photo-license/\' data-photo-modal-image-download-link=\'https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?cs=srgb&amp;dl=adult-hairy-hand-986835.jpg&amp;fm=jpg\' data-photo-modal-image-grid-item-src=\'https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;dpr=1&amp;w=500\' data-photo-modal-image-grid-item-srcset=\'https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;dpr=1&amp;w=500 1x, https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;dpr=2&amp;w=500 2x\' data-photo-modal-image-portrait=\'https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;fit=crop&amp;h=1200&amp;w=800\' data-photo-modal-image-src=\'https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;h=750&amp;w=1260\' data-photo-modal-image-srcset=\'https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;h=650&amp;w=940 940w, https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;h=750&amp;w=1260 1260w, https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;dpr=2&amp;h=650&amp;w=940 1880w, https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;dpr=2&amp;h=750&amp;w=1260 2520w\' data-photo-modal-image-style=\'background: rgb(90, 108, 29);max-height: 75vh;max-width: calc((3504 / 2336) * 75vh);min-height: 300px;min-width: calc((3504 / 2336) * 300px);\' data-photo-modal-image-zoom-src=\'https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;dpr=3&amp;h=750&amp;w=1260\' data-photo-modal-medium-id=\'986835\' data-photo-modal-photographer-id=\'365778\' data-photo-modal-type=\'Photo\' data-photo-modal-user-profile-avatar-src=\'https://images.pexels.com/users/avatars/365778/nick-demou-221.jpeg?w=256&amp;h=256&amp;fit=crop&amp;crop=faces\' data-photo-modal-user-profile-donation-link=\'/photo/person-holding-silver-iphone-5s-986835/donate/\' data-photo-modal-user-profile-full-name=\'Nick Demou\' data-photo-modal-user-profile-link=\'/@nick-demou-365778\' data-photo-modal-user-profile-location=\'Stoke-on-Trent, UK\' data-photo-modal-video-style=\'background: white;display: none;\' data-photo-modal-width=\'3504\' style=\'padding-top: 66.66666666666666%\'>\n<a class=\"js-photo-link photo-item__link\" style=\"background: rgb(90,108,29)\" title=\"Person Holding Silver Iphone 5s\" href=\"/photo/person-holding-silver-iphone-5s-986835/\"><img srcset=\"https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;dpr=1&amp;w=500 1x, https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;dpr=2&amp;w=500 2x\" class=\"photo-item__img\" alt=\"Person Holding Silver Iphone 5s\" data-image-width=\"3504\" data-image-height=\"2336\" data-big-src=\"https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;h=750&amp;w=1260\" data-large-src=\"https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;h=650&amp;w=940\" data-tiny-src=\"https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;dpr=1&amp;w=500\" data-tiny-srcset=\"https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;dpr=1&amp;w=500 1x, https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;dpr=2&amp;w=500 2x\" data-pin-media=\"https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;fit=crop&amp;h=1200&amp;w=800\" src=\"https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?auto=compress&amp;cs=tinysrgb&amp;dpr=1&amp;w=500\" />\n<div class=\'badge-container\'>\n<span class=\'favorite-badge\' data-tooltip=\'This photo was uploaded by one of the photographers you follow.\' data-tooltip-align=\'left\'>\n<img height=\"14\" width=\"14\" class=\"favorite-badge__icon\" src=\"/assets/favorite-f721c3d387889d5c3a9e0943c1836840a2954b9bebab846ca963877afee48f21.svg\" />\n<\/span>\n\n<span class=\"featured-badge\" data-tooltip=\"This photo was featured on the home page and can be found through the search.\" data-tooltip-align=\"left\">\n  <img height=\"14\" width=\"14\" class=\"featured-badge__icon\" src=\"/assets/star-1bf7ee8c305832829a0a1e0b5c5d901e34e6732cd67c90715cd9b554a785877b.svg\" />\n<\/span>\n\n<\/div>\n\n<\/a><a class=\"photo-item__photographer\" href=\"/@nick-demou-365778\"><img class=\"photo-item__avatar\" height=\"30\" width=\"30\" src=\"https://images.pexels.com/users/avatars/365778/nick-demou-221.jpeg?w=60&amp;h=60&amp;fit=crop&amp;crop=faces\" />\n<span class=\'photo-item__name\'>Nick Demou<\/span>\n<\/a><a download=\"true\" href=\"https://images.pexels.com/photos/986835/pexels-photo-986835.jpeg?cs=srgb&amp;dl=adult-hairy-hand-986835.jpg&amp;fm=jpg\"><\/a>\n<div class=\'photo-item__info\'>\n<button class=\'js-like js-like-986835 rd__button rd__button--like rd__button--no-padding rd__button--text-white rd__button--with-icon\' data-photo-id=\'986835\'>\n<i class=\'rd__button--like--not-active--icon rd__svg-icon\'><svg xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\"><path d=\"M16.5 3c-1.74 0-3.41.81-4.5 2.09C10.91 3.81 9.24 3 7.5 3 4.42 3 2 5.42 2 8.5c0 3.78 3.4 6.86 8.55 11.54L12 21.35l1.45-1.32C18.6 15.36 22 12.28 22 8.5 22 5.42 19.58 3 16.5 3zm-4.4 15.55l-.1.1-.1-.1C7.14 14.24 4 11.39 4 8.5 4 6.5 5.5 5 7.5 5c1.54 0 3.04.99 3.57 2.36h1.87C13.46 5.99 14.96 5 16.5 5c2 0 3.5 1.5 3.5 3.5 0 2.89-3.14 5.74-7.9 10.05z\"><\/path><\/svg>\n<\/i>\n<i class=\'rd__button--like--active--icon rd__svg-icon\' style=\'display: none\'><svg xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\"><path d=\"M12 21.35l-1.45-1.32C5.4 15.36 2 12.28 2 8.5 2 5.42 4.42 3 7.5 3c1.74 0 3.41.81 4.5 2.09C13.09 3.81 14.76 3 16.5 3 19.58 3 22 5.42 22 8.5c0 3.78-3.4 6.86-8.55 11.54L12 21.35z\"><\/path><\/svg>\n<\/i>\n<\/button>\n<button class=\'js-collect js-collect-986835 rd__button rd__button--collect rd__button--no-padding rd__button--text-white rd__button--with-icon\' data-photo-id=\'986835\'>\n<i class=\'rd__button--collect--not-active--icon rd__svg-icon\'><svg xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\"><path d=\"M13 7h-2v4H7v2h4v4h2v-4h4v-2h-4V7zm-1-5C6.48 2 2 6.48 2 12s4.48 10 10 10 10-4.48 10-10S17.52 2 12 2zm0 18c-4.41 0-8-3.59-8-8s3.59-8 8-8 8 3.59 8 8-3.59 8-8 8z\"><\/path><\/svg>\n<\/i>\n<i class=\'rd__button--collect--active--icon rd__svg-icon\' style=\'display: none\'><svg xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\"><path d=\"M12 2C6.48 2 2 6.48 2 12s4.48 10 10 10 10-4.48 10-10S17.52 2 12 2zm-2 15l-5-5 1.41-1.41L10 14.17l7.59-7.59L19 8l-9 9z\"><\/path><\/svg>\n<\/i>\n<\/button>\n<\/div>\n\n<\/article>\n\n<\/div>\n', 0);    
  infiniteScrollingAppender.execute()
})();

You can use your own way to parse the response and find the required data. The following code will collect the same "srcset" data of all the images as done in your code using BeautifulSoup(Note: The full response can not be made into soup as it is not a valid HTML). You can merge your downloading code with it.

import datetime, requests
from bs4 import BeautifulSoup
seed = datetime.datetime.now().strftime('%Y-%m-%d%%2B%H%%3A%M%%3A%S%%2B%%2B0000')
url = 'https://www.pexels.com/search/{}/?format=js&seed={}&page='.format(keyword, seed)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36'}
res = requests.get(url + '1', headers=headers)
# Extract total number of pages of results from response like "totalPages: 26,"
pages = int(res.text[res.text.find('totalPages')+11:res.text.find(',',res.text.find('totalPages')+11)])
imgurls = []

if not pages:
    print('Sorry, no pictures found!')
else:
    for page in range(1, pages+1):
        # Every new result of search is added after this string, so splitting response text with it.
        imgs = res.text.split('infiniteScrollingAppender.append')[1:]
        for img in imgs:
            # The response text has escaped single and double quotes with backslash, so replacing them to get valid html.
            soup = BeautifulSoup(img[2:-5].replace("\\'", "'").replace('\\"', '"'), 'html.parser')
            imgurls.append(soup.select('.photo-item__img')[0].get('srcset'))
        if page < pages:
            res = requests.get(url + str(page+1), headers=headers)

Let me know if you face any issue.

UPDATE:

You can find such xhr requests yourself in future using devtools of browser. For this case, open devtools in chrome and go to 'Network' tab, filter to show only XHR requests and then scroll to load more results. It will show a request like the above sample.

enter image description here

Upvotes: 1

Related Questions