Reputation: 3
I am trying to scrape the Reuters website for all the news headlines related to the Middle East. Link to the webpage: https://www.reuters.com/subjects/middle-east
This page automatically shows previous headlines as I scroll down but while I look at the page source, it only gives the last 20 headline links.
I tried to look for a next or previous hyperlink that usually is present for such problems but unfortunately, there isn't any such hyperlink on this page.
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.reuters.com/subjects/middle-east'
result = requests.get(url)
content = result.content
soup = BeautifulSoup(content, 'html.parser')
# Gets all the links on the page source
links = []
for hl in soup.find_all('a'):
if re.search('article', hl['href']):
links.append(hl['href'])
# The first link is the page itself and so we skip it
links = links[1:]
# The urls are repeated and so we only keep the unique instances
urls = []
for url in links:
if url not in urls:
urls.append(url)
# The number of urls is limited to 20 (THE PROBLEM!)
print(len(urls))
I have very limited experience with all of this but my best guess would be that the java or whatever code language the page is using makes it produce the previous results when scrolled down and is perhaps what I need to figure out to do using some module of Python.
The code goes further to extract other details from each of these links but that is irrelevant to the posted problem.
Upvotes: 0
Views: 184
Reputation: 8205
You could use selenium and Keys.PAGE_DOWN
option to first scroll down and then get the page source. You can then feed this to BeautifulSoup if you prefer.
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
browser = webdriver.Chrome(executable_path='/path/to/chromedriver')
browser.get("https://www.reuters.com/subjects/middle-east")
time.sleep(1)
elem = browser.find_element_by_tag_name("body")
no_of_pagedowns = 25
while no_of_pagedowns:
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(0.2)
no_of_pagedowns-=1
source=browser.page_source
soup = BeautifulSoup(source, 'html.parser')
# Gets all the links on the page source
links = []
for hl in soup.find_all('a'):
if re.search('article', hl['href']):
links.append(hl['href'])
# The first link is the page itself and so we skip it
links = links[1:]
# The urls are repeated and so we only keep the unique instances
urls = []
for url in links:
if url not in urls:
urls.append(url)
# The number of urls is limited to 20 (THE PROBLEM!)
print(len(urls))
Ouput
40
Upvotes: 0