TAN-C-F-OK
TAN-C-F-OK

Reputation: 179

Scrape URLs using BeautifulSoup in Python 3

I tried this code but the list with the URLs stays empty. No error massage, nothing.

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request('https://www.metacritic.com/browse/movies/genre/date?page=0', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()

soup = BeautifulSoup(html_page, features="xml")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^https://www.metacritic.com/movie/")}):
    links.append(link.get('href'))

print(links)

I want to scrape all the URLs that start with "https://www.metacritic.com/movie/" that are found in the given URL "https://www.metacritic.com/browse/movies/genre/date?page=0".

What am I doing wrong?

Upvotes: 2

Views: 11543

Answers (2)

leiropi
leiropi

Reputation: 464

First you should use the standard library "html.parser" instead of "xml" for parsing the page content. It deals better with broken html (see Beautiful Soup findAll doesn't find them all)

Then take a look at the source code of the page you are parsing. The elements you want to find look like this: <a href="/movie/woman-at-war">

So change your code like this:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request('https://www.metacritic.com/browse/movies/genre/date?page=0', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()

soup = BeautifulSoup(html_page, 'html.parser')
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^/movie/")}):
    links.append(link.get('href'))

print(links)

Upvotes: 6

gosuto
gosuto

Reputation: 5741

Your code is sound.

The list stays empty because there aren't any URLs on that page matching that pattern. Try re.compile("^/movie/") instead.

Upvotes: 2

Related Questions