Scraping multiple web pages, but the results are overwritten by the last url

Question

I want to scrape all the URLs from multiple web pages. It works, but only the results from the last web page are saved in the file.

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests

urls=['https://www.metacritic.com/browse/movies/genre/date?page=0', 'https://www.metacritic.com/browse/movies/genre/date?page=2', '...']

for url in urls:
    response = requests.get(url)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, features="html.parser")

links = []
for link in soup.findAll('a', attrs={'href': re.compile("^/movie/([a-zA-Z0-9\-])+$")}):
    links.append(link.get('href'))

filename = 'output.csv'

with open(filename, mode="w") as outfile:
    for s in links:
        outfile.write("%s
" % s)

What am I missing here?

It would even be cooler if I could use a csv file with all the urls instead of the list. But anything I tried was way off...

Sel&#231;uk · Accepted Answer

You are using last soup of your urls. You should move your second for each into the first one. Also you are getting all the elements matching with your regex. There are elements outside of the table you are trying to scrape.

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests

urls=['https://www.metacritic.com/browse/movies/genre/date?page=0', 'https://www.metacritic.com/browse/movies/genre/date?page=2']

links = []
for url in urls:
    response = requests.get(url)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, features="html.parser")
    #You should get only movies from list otherwise you will also append coming soon section. That is why we added select_one
    for link in soup.select_one('ol.list_products').findAll('a', attrs={'href': re.compile("^/movie/([a-zA-Z0-9\-])+$")}):
        links.append(link.get('href'))


filename = 'output.csv'

with open(filename, mode="w") as outfile:
    for s in links:
        outfile.write("%s
" % s)

Here is the result.

/movie/woman-at-war
/movie/destroyer
/movie/aquaman
/movie/bumblebee
/movie/between-worlds
/movie/american-renegades
/movie/mortal-engines
/movie/spider-man-into-the-spider-verse
/movie/the-quake
/movie/once-upon-a-deadpool
/movie/all-the-devils-men
/movie/dead-in-a-week-or-your-money-back
/movie/blood-brother-2018
/movie/ghostbox-cowboy
/movie/robin-hood-2018
/movie/creed-ii
/movie/outlaw-king
/movie/overlord-2018
/movie/the-girl-in-the-spiders-web
/movie/johnny-english-strikes-again
/movie/hunter-killer
/movie/bullitt-county
/movie/the-night-comes-for-us
/movie/galveston
/movie/the-oath-2018
/movie/mfkz
/movie/viking-destiny
/movie/loving-pablo
/movie/ride-2018
/movie/venom-2018
/movie/sicario-2-soldado
/movie/black-water
/movie/jurassic-world-fallen-kingdom
/movie/china-salesman
/movie/incredibles-2
/movie/superfly
/movie/believer
/movie/oceans-8
/movie/hotel-artemis
/movie/211
/movie/upgrade
/movie/adrift-2018
/movie/action-point
/movie/solo-a-star-wars-story
/movie/feral
/movie/show-dogs
/movie/deadpool-2
/movie/breaking-in
/movie/revenge
/movie/manhunt
/movie/avengers-infinity-war
/movie/supercon
/movie/love-bananas
/movie/rampage
/movie/ready-player-one
/movie/pacific-rim-uprising
/movie/tomb-raider
/movie/gringo
/movie/the-hurricane-heist

Scraping multiple web pages, but the results are overwritten by the last url

Answers (2)

Related Questions