Reputation: 1170
I want to scrape all links from a website that does not have pagination i.e., there's a 'LOAD MORE' button, but the URL does not change depending on how much data you've asked for.
When I BeautifulSoup
the page and ask for all the links, it simply displays the amount of links on the vanilla first page of the website. I can manually click through older content by clicking the 'LOAD MORE' button, but it there a way to do so programmatically?
This is what I mean:
page = urllib2.urlopen('http://www.thedailybeast.com/politics.html')
soup = soup = BeautifulSoup(page)
for link in soup.find_all('a'):
print link.get('href')
And unfortunately there's no URL that is responsible for pagination.
Upvotes: 1
Views: 2508
Reputation: 474061
When you click "Load More" button, there is an XHR request issued to the http://www.thedailybeast.com/politics.view.<page_number>.json
endpoint. You need to simulate that in your code and parse the JSON response. Working example using requests
:
import requests
with requests.Session() as session:
for page in range(1, 10):
print("Page number #%s" % page)
response = session.get("http://www.thedailybeast.com/politics.view.%s.json" % page)
data = response.json()
for article in data["stream"]:
print(article["title"])
Prints:
Page number #1
The Two Americas Behind Donald Trump and Bernie Sanders
...
Hillary Clinton’s Star-Studded NYC Bash: Katy Perry, Jamie Foxx, and More Toast the Candidate
Why Do These Republicans Hate Maya Angelou’s Post Office?
Page number #2
No, Joe Biden Is Not a Supreme Court Hypocrite
PC Hysteria Claims Another Professor
WHY BLACK CELEB ENDORSEMENTS MATTER MOST
...
Inside Trump’s Make Believe Presidential Addresses
...
Upvotes: 3