Mayank Jhunjhunwala
Mayank Jhunjhunwala

Reputation: 83

Scraping using BeautifulSoup only gets me 33 responses off of an infinite scrolling page. How do i increase the number of responses?

The website link:

https://collegedunia.com/management/human-resources-management-colleges

The code:

import requests
from bs4 import BeautifulSoup

r = requests.get("https://collegedunia.com/management/human-resources-management-colleges")
c = r.content

soup = BeautifulSoup(c,"html.parser")


all = soup.find_all("div",{"class":"jsx-765939686 col-4 mb-4 automate_client_img_snippet"})

l = []
for divParent in all:
    item = divParent.find("div",{"class":"jsx-765939686 listing-block text-uppercase bg-white position-relative"})
    d = {}

    d["Name"] = item.find("div",{"class":"jsx-765939686 top-block position-relative overflow-hidden"}).find("div",{"class":"jsx-765939686 clg-name-address"}).find("h3").text

    d["Rating"] = item.find("div",{"class":"jsx-765939686 bottom-block w-100 position-relative"}).find("ul").find_all("li")[-1].find("a").find("span").text
    
    d["Location"] = item.find("div",{"class":"jsx-765939686 clg-head d-flex"}).find("span").find("span",{"class":"mr-1"}).text
    
    l.append(d)

import pandas
df = pandas.DataFrame(l)
df.to_excel("Output.xlsx")
    

The page keeps adding colleges as you scroll down, i dont know if i could get all the data, but is there a way to atleast increase the number of responses i get. There are a total of 2506 entries, as can be seen on the website?

Upvotes: 1

Views: 719

Answers (2)

pb36
pb36

Reputation: 410

Seeing to your Question we can see it in the network requests data is being fetched from the ajax request and they are using base64 encoded params to fetch the data you can follow the below code to get the data and parse it in your desire format.

Code:

import json
import pandas
import requests
import base64

collegedata = []
count = 0
while True:
    datadict = {"url": "management/human-resources-management-colleges", "stream": "13", "sub_stream_id": "607",
                "page": count}
    data = base64.urlsafe_b64encode(json.dumps(datadict).encode()).decode()
    params = {
        "data": data
    }
    response = requests.get('https://collegedunia.com/web-api/listing', params=params).json()
    if response["hasNext"]:
        for i in response["colleges"]:
            d = {}
            d["Name"] = i["college_name"]
            d["Rating"] = i["rating"]
            d["Location"] = i["college_city"] + ", " + i["state"]
            collegedata.append(d)
            print(d)
    else:
        break
    count += 1

df = pandas.DataFrame(collegedata)
df.to_excel("Output.xlsx", index=False)

Output: Output

Let me know if you have any questions :)

Upvotes: 2

bigbounty
bigbounty

Reputation: 17408

When you analyse the website via the network tab on chrome, you can see the website makes xhr calls in the back.

The endpoint to which it sends a http get request is as follows:

https://collegedunia.com/web-api/listing?data=eyJ1cmwiOiJtYW5hZ2VtZW50L2h1bWFuLXJlc291cmNlcy1tYW5hZ2VtZW50LWNvbGxlZ2VzIiwic3RyZWFtIjoiMTMiLCJzdWJfc3RyZWFtX2lkIjoiNjA3IiwicGFnZSI6M30=

When you send a get via requests module, you get a json response back.

import requests

url = "https://collegedunia.com/web-api/listing?data=eyJ1cmwiOiJtYW5hZ2VtZW50L2h1bWFuLXJlc291cmNlcy1tYW5hZ2VtZW50LWNvbGxlZ2VzIiwic3RyZWFtIjoiMTMiLCJzdWJfc3RyZWFtX2lkIjoiNjA3IiwicGFnZSI6M30="

res = requests.get(url)

print(res.json())

But you need all the data, not only for page 1. The data sent in the request is base64 encoded i.e if you decode the data parameter of the get request, you can see the following

{"url":"management/human-resources-management-colleges","stream":"13","sub_stream_id":"607","page":3}

Now, change the page number, sub_stream_id, steam etc. accordingly and get the complete data from the website.

Upvotes: 2

Related Questions