Anant Gupta
Anant Gupta

Reputation: 53

Scraping content from infinite scroll website

I am trying to scrape the links in a webpage with infinite scrolling. I am able to fetch only the links on the first pane. How to proceed ahead so as to form a complete list of all the links. Here is what i have so far -


from bs4 import BeautifulSoup
import requests

html = "https://www.carwale.com/used/cars-for-sale/#sc=-1&so=-1&car=7&pn=8&lcr=168&ldr=0&lir=0"
html_content = requests.get(html).text
soup = BeautifulSoup(html_content, "lxml")
table = soup.find_all("div", {"class": "card-detail-block__data"})

y = []
for i in table:
    try:
        y.append(i.find("a", {"id":"linkToDetails"}).get('href'))
    except AttributeError:
        pass

z = [('carwale.com' + item) for item in y]
z

Upvotes: 3

Views: 545

Answers (2)

Ikram Khan Niazi
Ikram Khan Niazi

Reputation: 805

Try this

next_page = next_page = soup.find('a', rel='next', href=True)

if next_page:
   next_html_content = requests.get(next_page).text

The next page URL is hidden in the site source. You can find it by searching rel="next" tag in the browser.

Upvotes: 0

Prayson W. Daniel
Prayson W. Daniel

Reputation: 15588

You do not need BeautifulSoup to ninja HTML dom at all, as the website provides JSON responses that populated the HTML. Requests alone can do the work. If you monitor "Network" from Chrome or Firefox Development tool, you will see that for each load, the browser sends a get request to an API. Using that we can get clean json data out.

Disclaimer: I have not checked if this site allows web scraping. Do double check their terms of use. I am assuming that you did that.

I used Pandas, to help in dealing with tabular data and also exporting data to CSV or whatever format you prefer: pip install pandas

import pandas as pd
from requests import Session

# Using Session and a header
req = Session() 
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
                         'AppleWebKit/537.36 (KHTML, like Gecko) '\
                         'Chrome/75.0.3770.80 Safari/537.36',
          'Content-Type': 'application/json;charset=UTF-8'}
# Add headers
req.headers.update(headers)

BASE_URL = 'https://www.carwale.com/webapi/classified/stockfilters/'

# Monitoring the updates on Network, the params changes in each load
#sc=-1&so=-1&car=7&pn=1
#sc=-1&so=-1&car=7&pn=2&lcr=24&ldr=0&lir=0
#sc=-1&so=-1&car=7&pn=3&lcr=48&ldr=0&lir=0
#sc=-1&so=-1&car=7&pn=4&lcr=72&ldr=0&lir=0

params = dict(sc=-1, so=-1, car=7, pn=4, lcr=72, ldr=0, lir=0)

r = req.get(BASE_URL, params=params) #just like requests.get

# Check if everything is okay
assert r.ok, 'We did not get 200'

# get json data
data = r.json()

# Put it in DataFrame
df = pd.DataFrame(data['ResultData'])

print(df.head())

# to go to another page create a function:

def scrap_carwale(params):
    r = req.get(BASE_URL, params=params)
    if not r.ok:
        raise ConnectionError('We did not get 200')
    data = r.json()

    return  pd.DataFrame(data['ResultData'])


# Just first 5 pages :)    
for i in range(5):
    params['pn']+=1
    params['lcr']*=2

    dt = scrap_carwale(params)
    #append your data
    df = df.append(dt)

#print data sample
print(df.sample(10)

# Save data to csv or whatever format
df.to_csv('my_data.csv') #see df.to_?

This is the network enter image description here

Response: enter image description here

Sample of Results enter image description here

Upvotes: 1

Related Questions