Zlo
Zlo

Reputation: 1170

Get all links with BeautifulSoup from a single page website ('Load More' feature)

I want to scrape all links from a website that does not have pagination i.e., there's a 'LOAD MORE' button, but the URL does not change depending on how much data you've asked for.

When I BeautifulSoup the page and ask for all the links, it simply displays the amount of links on the vanilla first page of the website. I can manually click through older content by clicking the 'LOAD MORE' button, but it there a way to do so programmatically?

This is what I mean:

page = urllib2.urlopen('http://www.thedailybeast.com/politics.html')
soup = soup = BeautifulSoup(page)

for link in soup.find_all('a'):
    print link.get('href')

And unfortunately there's no URL that is responsible for pagination.

Upvotes: 1

Views: 2508

Answers (1)

alecxe
alecxe

Reputation: 474061

When you click "Load More" button, there is an XHR request issued to the http://www.thedailybeast.com/politics.view.<page_number>.json endpoint. You need to simulate that in your code and parse the JSON response. Working example using requests:

import requests

with requests.Session() as session:
    for page in range(1, 10):
        print("Page number #%s" % page)
        response = session.get("http://www.thedailybeast.com/politics.view.%s.json" % page)
        data = response.json()

        for article in data["stream"]:
            print(article["title"])

Prints:

Page number #1
The Two Americas Behind Donald Trump and Bernie Sanders
...
Hillary Clinton’s Star-Studded NYC Bash: Katy Perry, Jamie Foxx, and More Toast the Candidate
Why Do These Republicans Hate Maya Angelou’s Post Office?
Page number #2
No, Joe Biden Is Not a Supreme Court Hypocrite
PC Hysteria Claims Another Professor
WHY BLACK CELEB ENDORSEMENTS MATTER MOST
...
Inside Trump’s Make Believe Presidential Addresses
...

Upvotes: 3

Related Questions