MITHU
MITHU

Reputation: 154

Can't let a script built upon requests produce all the image links from a webpage

I'm trying to grab all the images from this webpage using requests. When I run the script that I've created so far is not getting anything at all. Although the images are available within page source, I can't get this script to work. I wish to scrape all the images which show up while scrolling to the bottom. I also noticed that some link https://www.pexels.com/sv-se/sok/office/?format=js&seed=&page=4&type= found in the dev tools generating all the content incrementing the page number attached to it. But I failed to produce images making use of that link as well.

I've written so far:

import requests
from bs4 import BeautifulSoup

url = 'https://www.pexels.com/sv-se/sok/office/'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'    
    s.headers['referer'] = 'https://www.pexels.com/sv-se/'
    r = s.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("a.photo-item__link > img.photo-item__img"):
        print(item['data-large-src'])

How can I grab all the image links from that webpage using requests?

Upvotes: 3

Views: 180

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195438

You can try this script to get all image links from the URL:

import re
import requests

url = 'https://www.pexels.com/sv-se/sok/office/?format=js&seed=&page={page}&type='

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0',
           'Referer': 'https://www.pexels.com/sv-se/sok/office/',
           'X-Requested-With': 'XMLHttpRequest',
           'Accept-Language': 'en-US,en;q=0.5'}
cookies = {'locale': 'sv-SE'}

page = 1
picture_num = 1
while True:
    data = requests.get(url.format(page=page), headers=headers, cookies=cookies).text
    total_pages = int(re.search(r'"totalPages"\s*:\s*(\d+)', data).group(1))
    imgs = re.findall(r"infiniteScrollingAppender\.append\('(.*?)',\s*'", data)

    if page > total_pages:
        break

    for d in imgs:
        d = d.replace(r'\'', "'").replace(r'\"', '"').replace(r'\/', "/").replace(r'\n', '\n')
        print('{}/{} picture_num={}'.format(page, total_pages, picture_num), BeautifulSoup(d, 'html.parser').select_one('[data-large-src]')['data-large-src'])
        picture_num += 1

    page += 1

Prints:

1/204 picture_num=1 https://images.pexels.com/photos/2041627/pexels-photo-2041627.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=2 https://images.pexels.com/photos/3987020/pexels-photo-3987020.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=3 https://images.pexels.com/photos/3810754/pexels-photo-3810754.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=4 https://images.pexels.com/photos/3178818/pexels-photo-3178818.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=5 https://images.pexels.com/photos/3861958/pexels-photo-3861958.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=6 https://images.pexels.com/photos/3862365/pexels-photo-3862365.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=7 https://images.pexels.com/photos/3746932/pexels-photo-3746932.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=8 https://images.pexels.com/photos/3277806/pexels-photo-3277806.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=9 https://images.pexels.com/photos/1957477/pexels-photo-1957477.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=10 https://images.pexels.com/photos/3184296/pexels-photo-3184296.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=11 https://images.pexels.com/photos/3184357/pexels-photo-3184357.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=12 https://images.pexels.com/photos/4064641/pexels-photo-4064641.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=13 https://images.pexels.com/photos/2041629/pexels-photo-2041629.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=14 https://images.pexels.com/photos/3184359/pexels-photo-3184359.jpeg?auto=compress&cs=tinysrgb&h=650&w=940


...and so on.

enter image description here

Upvotes: 1

Related Questions