Scraping content from infinite scroll website

Question

I am trying to scrape the links in a webpage with infinite scrolling. I am able to fetch only the links on the first pane. How to proceed ahead so as to form a complete list of all the links. Here is what i have so far -


from bs4 import BeautifulSoup
import requests

html = "https://www.carwale.com/used/cars-for-sale/#sc=-1&so=-1&car=7&pn=8&lcr=168&ldr=0&lir=0"
html_content = requests.get(html).text
soup = BeautifulSoup(html_content, "lxml")
table = soup.find_all("div", {"class": "card-detail-block__data"})

y = []
for i in table:
    try:
        y.append(i.find("a", {"id":"linkToDetails"}).get('href'))
    except AttributeError:
        pass

z = [('carwale.com' + item) for item in y]
z

Prayson W. Daniel · Accepted Answer

You do not need BeautifulSoup to ninja HTML dom at all, as the website provides JSON responses that populated the HTML. Requests alone can do the work. If you monitor "Network" from Chrome or Firefox Development tool, you will see that for each load, the browser sends a get request to an API. Using that we can get clean json data out.

Disclaimer: I have not checked if this site allows web scraping. Do double check their terms of use. I am assuming that you did that.

I used Pandas, to help in dealing with tabular data and also exporting data to CSV or whatever format you prefer: pip install pandas

import pandas as pd
from requests import Session

# Using Session and a header
req = Session() 
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
                         'AppleWebKit/537.36 (KHTML, like Gecko) '\
                         'Chrome/75.0.3770.80 Safari/537.36',
          'Content-Type': 'application/json;charset=UTF-8'}
# Add headers
req.headers.update(headers)

BASE_URL = 'https://www.carwale.com/webapi/classified/stockfilters/'

# Monitoring the updates on Network, the params changes in each load
#sc=-1&so=-1&car=7&pn=1
#sc=-1&so=-1&car=7&pn=2&lcr=24&ldr=0&lir=0
#sc=-1&so=-1&car=7&pn=3&lcr=48&ldr=0&lir=0
#sc=-1&so=-1&car=7&pn=4&lcr=72&ldr=0&lir=0

params = dict(sc=-1, so=-1, car=7, pn=4, lcr=72, ldr=0, lir=0)

r = req.get(BASE_URL, params=params) #just like requests.get

# Check if everything is okay
assert r.ok, 'We did not get 200'

# get json data
data = r.json()

# Put it in DataFrame
df = pd.DataFrame(data['ResultData'])

print(df.head())

# to go to another page create a function:

def scrap_carwale(params):
    r = req.get(BASE_URL, params=params)
    if not r.ok:
        raise ConnectionError('We did not get 200')
    data = r.json()

    return  pd.DataFrame(data['ResultData'])


# Just first 5 pages :)    
for i in range(5):
    params['pn']+=1
    params['lcr']*=2

    dt = scrap_carwale(params)
    #append your data
    df = df.append(dt)

#print data sample
print(df.sample(10)

# Save data to csv or whatever format
df.to_csv('my_data.csv') #see df.to_?

This is the network

Response:

Sample of Results

Scraping content from infinite scroll website

Answers (2)

Related Questions