Reputation: 65
this is the url that I want to scrape data from: https://en.prothomalo.com/search?q=road%20accident But it has no pagination that changes url with each click, rather it has only a load more button and clicking on that does not change anything in the url or script. How can I automatically scrape the whole page without clicking it manually using beautifulsoup in python? I have seen a similar problem in stackoverflow, but that was for json. But looks like my url is in html.
Inspecting the load more button shows this line of code:
<span class="load-more-content more-m_content_1XWY0 more-m_en-content_2lUOO">Load More</span>
Upvotes: 1
Views: 336
Reputation: 11515
import requests
def main(url):
params = {
"fields": "headline,subheadline,slug,url,hero-image-s3-key,hero-image-caption,hero-image-metadata,first-published-at,last-published-at,alternative,published-at,authors,author-name,author-id,sections,story-template,metadata,tags,cards",
"offset": "0",
"limit": "10", # increase here up to what you need.
"q": "road accident"
}
r = requests.get(url, params=params).json()
for num, x in enumerate(r['items'], start=1):
print("[{}] ---> {}".format(num, x['headline'].strip()))
main('https://en.prothomalo.com/api/v1/advanced-search')
Output:
[1] ---> Four people killed in Jashore road accident
[2] ---> Three killed in Mymensingh road accident
[3] ---> 2 killed in Fatullah road accident
[4] ---> 7 killed in three road accidents in Chattogram, Rangamati
[5] ---> ASI killed in Chattogram road accident
[6] ---> Mother-son killed in Sylhet road accident
[7] ---> RAB member, another killed in Gazipur road accident
[8] ---> 3 killed in Chattogram road accident
[9] ---> Road accident kills two workers in Noakhali
[10] ---> Couple killed in Meherpur road accident
Upvotes: 1
Reputation: 195438
Next pages are loaded with Javascript from external URL in Json format. You can use requests
library to simulate it. For example:
import json
import requests
url = "https://en.prothomalo.com/api/v1/advanced-search"
params = {
"fields": "headline,subheadline,slug,url,hero-image-s3-key,hero-image-caption,hero-image-metadata,first-published-at,last-published-at,alternative,published-at,authors,author-name,author-id,sections,story-template,metadata,tags,cards",
"offset": "0",
"limit": "6",
"q": "road accident",
}
for offset in range(0, 100, 6): # <-- increase offset here
params["offset"] = offset
data = requests.get(url, params=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for i in data["items"]:
print(i["headline"])
Prints:
Four people killed in Jashore road accident
Three killed in Mymensingh road accident
2 killed in Fatullah road accident
7 killed in three road accidents in Chattogram, Rangamati
ASI killed in Chattogram road accident
Mother-son killed in Sylhet road accident
RAB member, another killed in Gazipur road accident
3 killed in Chattogram road accident
Road accident kills two workers in Noakhali
Couple killed in Meherpur road accident
4 killed, 7 injured in Rangpur road accident
3 Bangladeshis killed in Oman road accident
Implement transport act to halt road accident deaths
One killed, 3 injured in Panchagarh road accident
Two musicians killed in Chattogram road accident
One killed in Panchagarh road accident
Road accident kills one in Narail
5 Bangladeshi workers killed in Oman road accident
Two killed in Sylhet road accident
... and so on.
Upvotes: 1