Reputation: 5152
I am trying to scrape the following page with requests
and BeautifulSoup
/Lxml
https://www.reuters.com/search/news?blob=soybean&sortBy=date&dateRange=all
This is the kind of pages that have a load more results
button.
I have found few pages explaining how to do so, but not within the frame of requests
.
I understand that I should spend few more hours researching that problem before resorting to asking here, as to show proof that I ve tried.
I've tried to look into the inspect pane, into the network tab etc. but i'm still a bit too fresh with requests to understand how to interact with javascript.
I don't need a fully blown script/solution as an answer, just some pointers as to how to do this very typical task with requests
, as to save me few precious hours of research.
Thanks in advance.
Upvotes: 5
Views: 6824
Reputation: 2559
Here's a quick script should show how this can be done with Selenium:
from selenium import webdriver
import time
url = "https://www.reuters.com/search/news?blob=soybean&sortBy=date&dateRange=all"
driver = webdriver.PhantomJS()
driver.get(url)
html = driver.page_source.encode('utf-8')
page_num = 0
while driver.find_elements_by_css_selector('.search-result-more-txt'):
driver.find_element_by_css_selector('.search-result-more-txt').click()
page_num += 1
print("getting page number "+str(page_num))
time.sleep(1)
html = driver.page_source.encode('utf-8')
I don't know how to do this with requests
. There seems to be lots of articles about soybeans on Reuters. I've already done over 250 "page loads" as I finish writing this answer.
Once you scrape all, or some large amount of pages, you can then scrape the data by passing html
into Beautiful Soup:
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', attrs={"class":'search-result-indiv'})
articles = [a.find('a')['href'] for a in links if a != '']
Upvotes: 8