Nakkhatra
Nakkhatra

Reputation: 65

Load more pagination in html webpage- Webscraping

this is the url that I want to scrape data from: https://en.prothomalo.com/search?q=road%20accident But it has no pagination that changes url with each click, rather it has only a load more button and clicking on that does not change anything in the url or script. How can I automatically scrape the whole page without clicking it manually using beautifulsoup in python? I have seen a similar problem in stackoverflow, but that was for json. But looks like my url is in html.

Inspecting the load more button shows this line of code:

<span class="load-more-content more-m_content_1XWY0 more-m_en-content_2lUOO">Load More</span>

Upvotes: 1

Views: 336

Answers (2)

import requests


def main(url):
    params = {
        "fields": "headline,subheadline,slug,url,hero-image-s3-key,hero-image-caption,hero-image-metadata,first-published-at,last-published-at,alternative,published-at,authors,author-name,author-id,sections,story-template,metadata,tags,cards",
        "offset": "0",
        "limit": "10",  # increase here up to what you need.
        "q": "road accident"
    }
    r = requests.get(url, params=params).json()
    for num, x in enumerate(r['items'], start=1):
        print("[{}] ---> {}".format(num, x['headline'].strip()))


main('https://en.prothomalo.com/api/v1/advanced-search')

Output:

[1] ---> Four people killed in Jashore road accident
[2] ---> Three killed in  Mymensingh road accident
[3] ---> 2 killed in Fatullah road accident
[4] ---> 7 killed in three road accidents in Chattogram, Rangamati
[5] ---> ASI killed in Chattogram road accident
[6] ---> Mother-son killed in Sylhet road accident
[7] ---> RAB member, another killed in Gazipur road accident
[8] ---> 3 killed in Chattogram road accident
[9] ---> Road accident kills two workers in Noakhali
[10] ---> Couple killed in Meherpur road accident

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195438

Next pages are loaded with Javascript from external URL in Json format. You can use requests library to simulate it. For example:

import json
import requests


url = "https://en.prothomalo.com/api/v1/advanced-search"

params = {
    "fields": "headline,subheadline,slug,url,hero-image-s3-key,hero-image-caption,hero-image-metadata,first-published-at,last-published-at,alternative,published-at,authors,author-name,author-id,sections,story-template,metadata,tags,cards",
    "offset": "0",
    "limit": "6",
    "q": "road accident",
}

for offset in range(0, 100, 6): # <-- increase offset here
    params["offset"] = offset
    data = requests.get(url, params=params).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    for i in data["items"]:
        print(i["headline"])

Prints:

Four people killed in Jashore road accident
Three killed in  Mymensingh road accident
2 killed in Fatullah road accident
7 killed in three road accidents in Chattogram, Rangamati


ASI killed in Chattogram road accident
Mother-son killed in Sylhet road accident


RAB member, another killed in Gazipur road accident 
3 killed in Chattogram road accident 
Road accident kills two workers in Noakhali
Couple killed in Meherpur road accident
4 killed, 7 injured in Rangpur road accident
3 Bangladeshis killed in Oman road accident


Implement transport act to halt road accident deaths
One killed, 3 injured in Panchagarh road accident 
Two musicians killed in Chattogram road accident
One killed in Panchagarh road accident


Road accident kills one in Narail


5 Bangladeshi workers killed in Oman road accident
Two killed in Sylhet road accident

... and so on.

Upvotes: 1

Related Questions