james joyce
james joyce

Reputation: 493

Scrape data from lazy Loading Page

I am trying to scrape the data from this webpage, and i am successfully able to scrape the data what i need.
Problem is the downloaded page using requests has only 45 product details but actually on that webpage it has more than 4000 products, this is happening because all data is not available directly it shows only if you scroll down to the page.
I would like to scrape all products that is available on the page.

CODE

import requests
from bs4 import BeautifulSoup
import json
import re

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}

base_url = "link that i provided"
r = requests.get(base_url,headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')

scripts = soup.find_all('script')[11].text
script = scripts.split('=', 1)[1]
script = script.rstrip()
script = script[:-1]

data = json.loads(script) 

skus = list(data['grid']['entities'].keys())

prodpage = []
for sku in skus:
   prodpage.append('https://www.ajio.com{}'.format(data['grid']['entities'][sku]['url']))

print(len(prodpage))   

Upvotes: 1

Views: 6987

Answers (1)

Ahmed Soliman
Ahmed Soliman

Reputation: 1710

Scrolling down means the data is being generated by JavaScript so you have more than one option here first one is to use selenium second one is to send the same Ajax request the website is using as follows :

def get_source(page_num = 1):
        url = 'https://www.ajio.com/api/category/830216001?fields=SITE&currentPage={}&pageSize=45&format=json&query=%3Arelevance%3Abrickpattern%3AWashed&sortBy=relevance&gridColumns=3&facets=brickpattern%3AWashed&advfilter=true'

        res = requests.get(url.format(1),headers={'User-Agent': 'Mozilla/5.0'})
        if res.status_code == 200 :
                return res.json()
# data = get_source(page_num = 1)
# total_pages = data['pagination']['totalPages'] # total pages are 111
prodpage = []
for i in range(1,112):
        print(f'Getting page {i}')
        data = get_source(page_num = i)['products']
        for item in data:
                prodpage.append('https://www.ajio.com{}'.format(item['url']))
        if i == 3: break
print(len(prodpage)) # output 135 for 3 pages 

Upvotes: 4

Related Questions