Reputation: 53
I am trying to scrape the links in a webpage with infinite scrolling. I am able to fetch only the links on the first pane. How to proceed ahead so as to form a complete list of all the links. Here is what i have so far -
from bs4 import BeautifulSoup
import requests
html = "https://www.carwale.com/used/cars-for-sale/#sc=-1&so=-1&car=7&pn=8&lcr=168&ldr=0&lir=0"
html_content = requests.get(html).text
soup = BeautifulSoup(html_content, "lxml")
table = soup.find_all("div", {"class": "card-detail-block__data"})
y = []
for i in table:
try:
y.append(i.find("a", {"id":"linkToDetails"}).get('href'))
except AttributeError:
pass
z = [('carwale.com' + item) for item in y]
z
Upvotes: 3
Views: 545
Reputation: 805
Try this
next_page = next_page = soup.find('a', rel='next', href=True)
if next_page:
next_html_content = requests.get(next_page).text
The next page URL is hidden in the site source. You can find it by searching rel="next"
tag in the browser.
Upvotes: 0
Reputation: 15588
You do not need BeautifulSoup to ninja HTML dom at all, as the website provides JSON responses that populated the HTML. Requests alone can do the work. If you monitor "Network" from Chrome or Firefox Development tool, you will see that for each load, the browser sends a get request to an API. Using that we can get clean json data out.
Disclaimer: I have not checked if this site allows web scraping. Do double check their terms of use. I am assuming that you did that.
I used Pandas, to help in dealing with tabular data and also exporting data to CSV or whatever format you prefer: pip install pandas
import pandas as pd
from requests import Session
# Using Session and a header
req = Session()
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) '\
'Chrome/75.0.3770.80 Safari/537.36',
'Content-Type': 'application/json;charset=UTF-8'}
# Add headers
req.headers.update(headers)
BASE_URL = 'https://www.carwale.com/webapi/classified/stockfilters/'
# Monitoring the updates on Network, the params changes in each load
#sc=-1&so=-1&car=7&pn=1
#sc=-1&so=-1&car=7&pn=2&lcr=24&ldr=0&lir=0
#sc=-1&so=-1&car=7&pn=3&lcr=48&ldr=0&lir=0
#sc=-1&so=-1&car=7&pn=4&lcr=72&ldr=0&lir=0
params = dict(sc=-1, so=-1, car=7, pn=4, lcr=72, ldr=0, lir=0)
r = req.get(BASE_URL, params=params) #just like requests.get
# Check if everything is okay
assert r.ok, 'We did not get 200'
# get json data
data = r.json()
# Put it in DataFrame
df = pd.DataFrame(data['ResultData'])
print(df.head())
# to go to another page create a function:
def scrap_carwale(params):
r = req.get(BASE_URL, params=params)
if not r.ok:
raise ConnectionError('We did not get 200')
data = r.json()
return pd.DataFrame(data['ResultData'])
# Just first 5 pages :)
for i in range(5):
params['pn']+=1
params['lcr']*=2
dt = scrap_carwale(params)
#append your data
df = df.append(dt)
#print data sample
print(df.sample(10)
# Save data to csv or whatever format
df.to_csv('my_data.csv') #see df.to_?
Upvotes: 1