Gabe
Gabe

Reputation: 3

Web scraping with requests not working correctly

I am trying to get the html from CNN for a personal project. I am using the requests library and am new to it. I have followed basic tutorials to get the HTML from CNN using requests, but keep getting responses that are different from the HTML I find when I inspect the webpage from my browser. Here is my code:

base_url = 'https://www.cnn.com/'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())

I am trying to get article titles from CNN, but this is my first issue. Thanks for the help!

Update It seems that I know even less than I had initially assumed. My real question is: How do I extract titles from the CNN homepage? I've tried both answers, but the HTML from requests does not contain title information. How can I get the title information like what is in this picture (Screenshot of my browser)Screenshot of cnn article title with accompanying html side by side

Upvotes: 0

Views: 1742

Answers (3)

Mrugesh Kadia
Mrugesh Kadia

Reputation: 555

You can use Selenium ChromeDriver to scrape https://cnn.com.

import bs4 as bs
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
driver = webdriver.Chrome("---CHROMEDRIVER-PATH---", options=chrome_options)

driver.get('https://cnn.com/')
soup = bs.BeautifulSoup(driver.page_source, 'lxml')

# Get Titles from HTML.
titles = soup.find_all('span', {'class': 'cd__headline-text'})
print(titles)

# Close ChromeDriver.
driver.close()
driver.quit()

Output:

[<span class="cd__headline-text"><strong>The West turned Aung San Suu Kyi into a saint. She was always going to disappoint </strong></span>, <span class="cd__headline-text"><strong>In Hindu-nationalist India, Muslims risk being branded infiltrators</strong></span>, <span class="cd__headline-text">Johnson may have stormed to victory, but he's got a problem</span>, <span class="cd__headline-text">Impeachment heads to full House after historic vote</span>, <span class="cd__headline-text">Supreme Court to decide on Trump's financial records</span>, <span class="cd__headline-text">Michelle Obama's message for Thunberg after Trump mocks her</span>, <span class="cd__headline-text">Actor Danny Aiello dies at 86</span>, <span class="cd__headline-text">The biggest risk at the North Pole isn't what you think</span>, <span class="cd__headline-text">US city declares state of emergency after cyberattack </span>, <span class="cd__headline-text">Reality TV show host arrested</span>, <span class="cd__headline-text">Big names in 2019 you may have mispronounced</span>, <span class="cd__headline-text"><strong>Morocco has Africa's 'first fully solar village'</strong></span>]

You can download ChromeDriver from here.

Upvotes: 2

Debdut Goswami
Debdut Goswami

Reputation: 1379

I tried the following code and it worked for me.

base_url = 'https://www.cnn.com/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'
}
r = requests.get(base_url, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())

Note that I have specified a headers parameter in requests.get(). All it does is that it tries to mimic a real browser so that the anti-scraping algorithms can't be able to detect it.
Hope this helps and if not then feel free to ask me in the comments. Cheers :)

Upvotes: 1

petezurich
petezurich

Reputation: 10174

I just checked. CNN seems to recognize that you programmatically try to scrape the site and serves a 404 / missing page (with no content on it) instead of the homepage.

Try a headless browser like Selenium, e.g. like so:

from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://cnn.com')
html = driver.page_source

Upvotes: 0

Related Questions