user12388818
user12388818

Reputation:

Simple Web Scraper isn't printing anything. What is the problem?

import requests
from bs4 import BeautifulSoup as bs

results = requests.get("https://www.cnn.com")
src = results.content
soup = bs(src, 'lxml')

urls = []

for h3_tag in soup.find_all("h3"):
    a_tag = h3_tag.find("a")
    urls.append(a_tag.attrs["href"])

for url in urls:
    print(url + "\n")
print(urls)

For some reason my program is printing an empty list and I can't seem to figure what the problem is. I'm pretty sure the error is in the first for loop but I'm not sure.

Upvotes: 2

Views: 102

Answers (1)

Muon
Muon

Reputation: 1346

The webpage isn't fully loading before you try and pull it with requests so there's no h3 tags rendered yet for you to pull. This is because a lot of these elements are rendered using javascript. You can use web browser automation (like Selenium) to get around this.

In this example I have used the Mozilla Geckodriver whcih you can download from the release page here.

from bs4 import BeautifulSoup as bs
from selenium import webdriver

# load the driver
driver = webdriver.Firefox(executable_path='Development/webdrivers/geckodriver')

# get the content and pass to BS
driver.get('https://www.cnn.com')
html = driver.page_source
soup = bs(html, 'lxml')

# get links (simplified using list comprehension)
urls = [h3_tag.find("a").attrs["href"] for h3_tag in soup.find_all("h3")]

# result
print(urls)

# close the driver
driver.close()

Output

['/2019/11/18/politics/ukraine-zelensky-pressure-trump-investigations/index.html',
 '/2019/11/18/politics/house-investigating-trump-lying-to-mueller/index.html',
 '/2019/11/18/politics/trump-tax-documents-supreme-court/index.html',
 '/2019/11/18/politics/house-ways-means-irs-whistleblower/index.html',
 '/videos/politics/2019/11/18/trump-walter-reed-visit-jonathan-reiner-nr-vpx.cnn',
 '/2019/11/18/asia/hong-kong-poly-university-protest-police-intl-hnk/index.html',
 '/2019/11/18/asia/south-china-sea-intl-hnk/index.html',
 '/2019/11/18/politics/pompeo-west-bank-settlements-announcement/index.html',
 '/2019/11/18/uk/prince-andrew-has-thrown-a-fireblanket-over-the-brexit-election-intl-ge19-gbr/index.html',
 '/2019/11/18/uk/jennifer-arcuri-boris-johnson-interview-ge19-gbr-intl/index.html',
 '/2019/11/18/asia/north-korea-us-meeting-intl/index.html',
 '/2019/11/18/africa/france-returns-stolen-sword-to-senegal/index.html',
 '/2019/11/18/us/fresno-mass-shooting-football-party/index.html',
 '/2019/11/18/football/ahmad-mendes-moreira-racist-abuse-fc-den-bosch-excelsior-spt-intl/index.html',
 '/2019/11/18/health/china-bubonic-plague-intl-hnk-scn-scli/index.html',
 '/travel/article/will-i-am-qantas-racism-row-intl-scli/index.html',
 '/2019/11/18/uk/blind-student-oxford-union-scli-intl-gbr/index.html',
 '/2019/11/18/middleeast/iran-protests-explained-intl/index.html',
 '/2019/11/18/business/coty-kylie-cosmetics-deal/index.html',
 '/2019/11/18/sport/israel-folau-bushfires-intl-spt/index.html',
 '/2019/11/18/business/airbus-emirates-dubai-air-show/index.html',
 '/2019/11/18/us/oklahoma-walmart-shooting/index.html',
 '/2019/11/18/us/minnesota-twins-prospect-ryan-costello-dead-trnd/index.html',
 '/travel/article/unruly-airplane-passengers/index.html',
 '/style/article/china-beijing-silvermine-negatives/index.html',
 '/2019/11/18/world/bizarre-basking-shark-scn-trnd/index.html',
 '/style/article/banksy-drinker-sale-intl-scli/index.html',
 '/2019/11/18/business/marie-kondo-online-shop/index.html',
 '/2019/11/18/tennis/tsitsipas-atp-finals-tennis-spt-intl/index.html',
 '/2019/11/18/health/samoa-measles-emergency-intl-scli/index.html',
 '/2019/11/18/africa/bogaletch-gebre-obit-trnd/index.html',
 '/style/article/the-crown-royal-fashion/index.html',
 '/travel/article/thailand-bullet-trains/index.html',
 '/2019/11/18/entertainment/prince-philips-mother-princess-alice-interesting-facts-intl-scli/index.html',
 '/2019/11/18/opinions/trump-assault-weapons-export-abramson/index.html',
 '/2019/11/17/opinions/donald-trump-magic-evaporating-campaign-trail-obeidallah/index.html',
 '/2019/11/16/opinions/this-is-life-counterterrorism-la-bau/index.html',
 '/2019/11/18/perspectives/andrew-yang-technology/index.html']

Upvotes: 2

Related Questions