Scrape URLs using BeautifulSoup in Python 3

Question

I tried this code but the list with the URLs stays empty. No error massage, nothing.

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request('https://www.metacritic.com/browse/movies/genre/date?page=0', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()

soup = BeautifulSoup(html_page, features="xml")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^https://www.metacritic.com/movie/")}):
    links.append(link.get('href'))

print(links)

I want to scrape all the URLs that start with "https://www.metacritic.com/movie/" that are found in the given URL "https://www.metacritic.com/browse/movies/genre/date?page=0".

What am I doing wrong?

leiropi · Accepted Answer

First you should use the standard library "html.parser" instead of "xml" for parsing the page content. It deals better with broken html (see Beautiful Soup findAll doesn't find them all)

Then take a look at the source code of the page you are parsing. The elements you want to find look like this:

So change your code like this:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request('https://www.metacritic.com/browse/movies/genre/date?page=0', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()

soup = BeautifulSoup(html_page, 'html.parser')
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^/movie/")}):
    links.append(link.get('href'))

print(links)

Scrape URLs using BeautifulSoup in Python 3

Answers (2)

Related Questions