Reputation: 179
I tried this code but the list with the URLs stays empty. No error massage, nothing.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
req = Request('https://www.metacritic.com/browse/movies/genre/date?page=0', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()
soup = BeautifulSoup(html_page, features="xml")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^https://www.metacritic.com/movie/")}):
links.append(link.get('href'))
print(links)
I want to scrape all the URLs that start with "https://www.metacritic.com/movie/" that are found in the given URL "https://www.metacritic.com/browse/movies/genre/date?page=0".
What am I doing wrong?
Upvotes: 2
Views: 11543
Reputation: 464
First you should use the standard library "html.parser" instead of "xml" for parsing the page content. It deals better with broken html (see Beautiful Soup findAll doesn't find them all)
Then take a look at the source code of the page you are parsing. The elements you want to find look like this: <a href="/movie/woman-at-war">
So change your code like this:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
req = Request('https://www.metacritic.com/browse/movies/genre/date?page=0', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()
soup = BeautifulSoup(html_page, 'html.parser')
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^/movie/")}):
links.append(link.get('href'))
print(links)
Upvotes: 6
Reputation: 5741
Your code is sound.
The list stays empty because there aren't any URLs on that page matching that pattern. Try re.compile("^/movie/")
instead.
Upvotes: 2