jim jarnac
jim jarnac

Reputation: 5152

Scrape page with "load more results" button

I am trying to scrape the following page with requests and BeautifulSoup/Lxml

https://www.reuters.com/search/news?blob=soybean&sortBy=date&dateRange=all

This is the kind of pages that have a load more results button. I have found few pages explaining how to do so, but not within the frame of requests.

I understand that I should spend few more hours researching that problem before resorting to asking here, as to show proof that I ve tried.

I've tried to look into the inspect pane, into the network tab etc. but i'm still a bit too fresh with requests to understand how to interact with javascript.

I don't need a fully blown script/solution as an answer, just some pointers as to how to do this very typical task with requests, as to save me few precious hours of research.

Thanks in advance.

Upvotes: 5

Views: 6824

Answers (1)

briancaffey
briancaffey

Reputation: 2559

Here's a quick script should show how this can be done with Selenium:

from selenium import webdriver
import time

url = "https://www.reuters.com/search/news?blob=soybean&sortBy=date&dateRange=all"
driver = webdriver.PhantomJS()
driver.get(url)
html = driver.page_source.encode('utf-8')
page_num = 0

while driver.find_elements_by_css_selector('.search-result-more-txt'):
    driver.find_element_by_css_selector('.search-result-more-txt').click()
    page_num += 1
    print("getting page number "+str(page_num))
    time.sleep(1)

html = driver.page_source.encode('utf-8')

I don't know how to do this with requests. There seems to be lots of articles about soybeans on Reuters. I've already done over 250 "page loads" as I finish writing this answer.

Once you scrape all, or some large amount of pages, you can then scrape the data by passing html into Beautiful Soup:

soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', attrs={"class":'search-result-indiv'})
articles = [a.find('a')['href'] for a in links if a != '']

Upvotes: 8

Related Questions