Reputation: 458
I'm scraping the WSJ using BeautifulSoup, but it seemingly can never find the element with id="top-news", which is always available on the home page. I've tried find(), find_all() and a variety of other methods and they all return a NoneType
for any method called on my results
object.
I'm trying to extract metadata about the top news articles, primarily the article title and url. Each article's metadata is under a class named "WSJTheme--headline--7VCzo7Ay ", but I only want those located under the "top-news" div.
Here is my code:
import requests
from bs4 import BeautifulSoup
from shutil import copyfile
URL = 'https://www.wsj.com'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='top-news')
topArticles = results.find_all('div', class_='WSJTheme--headline--7VCzo7Ay ')
Upvotes: 1
Views: 350
Reputation: 195438
Specify User-Agent
to get correct response from the server:
import requests
from bs4 import BeautifulSoup
url = "https://www.wsj.com/"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
for headline in soup.select('#top-news span[class*="headline"]'):
print(headline.text)
Prints:
Oil Giants Dealt Defeats as Climate Pressures Intensify
At Least Eight Killed in San Jose Shooting
HSBC to Exit Most U.S. Retail Banking
Amazon-MGM Deal Marks Win for Hedge Funds
Cities Reverse Defunding the Police Amid Rising Crime
Federal Prosecutors Have Asked Banks for Information About Archegos Meltdown
Why a Grand Plan to Vaccinate the World Against Covid Unraveled
Inside the Israel-Hamas Conflict and One of Its Deadliest Hours in Gaza
Eric Carle, ‘The Very Hungry Caterpillar’ Author, Dies at 91
Wynn May Face U.S. Action for Role in China’s Push to Expel Businessman
Walmart to Sell New Line of Gap-Branded Homegoods
Upvotes: 1