Chholak
Chholak

Reputation: 3

How to get more than 20 news headline links for a subsection (e.g. Middle East) of Reuters website using Python?

I am trying to scrape the Reuters website for all the news headlines related to the Middle East. Link to the webpage: https://www.reuters.com/subjects/middle-east

This page automatically shows previous headlines as I scroll down but while I look at the page source, it only gives the last 20 headline links.

I tried to look for a next or previous hyperlink that usually is present for such problems but unfortunately, there isn't any such hyperlink on this page.

import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.reuters.com/subjects/middle-east'

result = requests.get(url)
content = result.content
soup = BeautifulSoup(content, 'html.parser')  

# Gets all the links on the page source
links = []
for hl in soup.find_all('a'):
    if re.search('article', hl['href']):
        links.append(hl['href'])

# The first link is the page itself and so we skip it
links = links[1:]

# The urls are repeated and so we only keep the unique instances
urls = []
for url in links:
    if url not in urls:
        urls.append(url)

# The number of urls is limited to 20 (THE PROBLEM!)
print(len(urls))

I have very limited experience with all of this but my best guess would be that the java or whatever code language the page is using makes it produce the previous results when scrolled down and is perhaps what I need to figure out to do using some module of Python.

The code goes further to extract other details from each of these links but that is irrelevant to the posted problem.

Upvotes: 0

Views: 184

Answers (1)

Bitto
Bitto

Reputation: 8205

You could use selenium and Keys.PAGE_DOWN option to first scroll down and then get the page source. You can then feed this to BeautifulSoup if you prefer.

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re

browser = webdriver.Chrome(executable_path='/path/to/chromedriver')
browser.get("https://www.reuters.com/subjects/middle-east")
time.sleep(1)

elem = browser.find_element_by_tag_name("body")
no_of_pagedowns = 25
while no_of_pagedowns:
    elem.send_keys(Keys.PAGE_DOWN)
    time.sleep(0.2)
    no_of_pagedowns-=1

source=browser.page_source
soup = BeautifulSoup(source, 'html.parser')

# Gets all the links on the page source
links = []
for hl in soup.find_all('a'):
    if re.search('article', hl['href']):
        links.append(hl['href'])

# The first link is the page itself and so we skip it
links = links[1:]

# The urls are repeated and so we only keep the unique instances
urls = []
for url in links:
    if url not in urls:
        urls.append(url)

# The number of urls is limited to 20 (THE PROBLEM!)
print(len(urls))

Ouput

40

Upvotes: 0

Related Questions