Reputation: 154
I'm trying to grab all the images from this webpage using requests. When I run the script that I've created so far is not getting anything at all. Although the images are available within page source, I can't get this script to work. I wish to scrape all the images which show up while scrolling to the bottom. I also noticed that some link https://www.pexels.com/sv-se/sok/office/?format=js&seed=&page=4&type=
found in the dev tools generating all the content incrementing the page number attached to it. But I failed to produce images making use of that link as well.
I've written so far:
import requests
from bs4 import BeautifulSoup
url = 'https://www.pexels.com/sv-se/sok/office/'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'
s.headers['referer'] = 'https://www.pexels.com/sv-se/'
r = s.get(url)
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select("a.photo-item__link > img.photo-item__img"):
print(item['data-large-src'])
How can I grab all the image links from that webpage using requests?
Upvotes: 3
Views: 180
Reputation: 195438
You can try this script to get all image links from the URL:
import re
import requests
url = 'https://www.pexels.com/sv-se/sok/office/?format=js&seed=&page={page}&type='
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0',
'Referer': 'https://www.pexels.com/sv-se/sok/office/',
'X-Requested-With': 'XMLHttpRequest',
'Accept-Language': 'en-US,en;q=0.5'}
cookies = {'locale': 'sv-SE'}
page = 1
picture_num = 1
while True:
data = requests.get(url.format(page=page), headers=headers, cookies=cookies).text
total_pages = int(re.search(r'"totalPages"\s*:\s*(\d+)', data).group(1))
imgs = re.findall(r"infiniteScrollingAppender\.append\('(.*?)',\s*'", data)
if page > total_pages:
break
for d in imgs:
d = d.replace(r'\'', "'").replace(r'\"', '"').replace(r'\/', "/").replace(r'\n', '\n')
print('{}/{} picture_num={}'.format(page, total_pages, picture_num), BeautifulSoup(d, 'html.parser').select_one('[data-large-src]')['data-large-src'])
picture_num += 1
page += 1
Prints:
1/204 picture_num=1 https://images.pexels.com/photos/2041627/pexels-photo-2041627.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=2 https://images.pexels.com/photos/3987020/pexels-photo-3987020.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=3 https://images.pexels.com/photos/3810754/pexels-photo-3810754.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=4 https://images.pexels.com/photos/3178818/pexels-photo-3178818.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=5 https://images.pexels.com/photos/3861958/pexels-photo-3861958.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=6 https://images.pexels.com/photos/3862365/pexels-photo-3862365.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=7 https://images.pexels.com/photos/3746932/pexels-photo-3746932.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=8 https://images.pexels.com/photos/3277806/pexels-photo-3277806.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=9 https://images.pexels.com/photos/1957477/pexels-photo-1957477.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=10 https://images.pexels.com/photos/3184296/pexels-photo-3184296.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=11 https://images.pexels.com/photos/3184357/pexels-photo-3184357.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=12 https://images.pexels.com/photos/4064641/pexels-photo-4064641.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=13 https://images.pexels.com/photos/2041629/pexels-photo-2041629.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
1/204 picture_num=14 https://images.pexels.com/photos/3184359/pexels-photo-3184359.jpeg?auto=compress&cs=tinysrgb&h=650&w=940
...and so on.
Upvotes: 1