TAN-C-F-OK
TAN-C-F-OK

Reputation: 179

Scraping multiple web pages, but the results are overwritten by the last url

I want to scrape all the URLs from multiple web pages. It works, but only the results from the last web page are saved in the file.

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests

urls=['https://www.metacritic.com/browse/movies/genre/date?page=0', 'https://www.metacritic.com/browse/movies/genre/date?page=2', '...']

for url in urls:
    response = requests.get(url)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, features="html.parser")

links = []
for link in soup.findAll('a', attrs={'href': re.compile("^/movie/([a-zA-Z0-9\-])+$")}):
    links.append(link.get('href'))

filename = 'output.csv'

with open(filename, mode="w") as outfile:
    for s in links:
        outfile.write("%s\n" % s)

What am I missing here?

It would even be cooler if I could use a csv file with all the urls instead of the list. But anything I tried was way off...

Upvotes: 0

Views: 327

Answers (2)

Brandon Bailey
Brandon Bailey

Reputation: 811

Hey this is my first answer so ill try my best to help.

The problem with the data overwrite is that you're iterating through your urls in one loop, then iterating through the soup object in another loop.

This will always return the last soup object at the end of the loop so the best thing to do would be to either append each soup object to an array from within the url loop or actually query the soup object when in the url loop:

soup_obj_list = []
for url in urls:
    response = requests.get(url)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, features="html.parser")
    soup_obj_list.append(soup)

hope that solves your first problem. cant really help with the csv issue.

Upvotes: 1

Selçuk
Selçuk

Reputation: 1123

You are using last soup of your urls. You should move your second for each into the first one. Also you are getting all the elements matching with your regex. There are elements outside of the table you are trying to scrape.

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests

urls=['https://www.metacritic.com/browse/movies/genre/date?page=0', 'https://www.metacritic.com/browse/movies/genre/date?page=2']

links = []
for url in urls:
    response = requests.get(url)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, features="html.parser")
    #You should get only movies from list otherwise you will also append coming soon section. That is why we added select_one
    for link in soup.select_one('ol.list_products').findAll('a', attrs={'href': re.compile("^/movie/([a-zA-Z0-9\-])+$")}):
        links.append(link.get('href'))


filename = 'output.csv'

with open(filename, mode="w") as outfile:
    for s in links:
        outfile.write("%s\n" % s)

Here is the result.

/movie/woman-at-war
/movie/destroyer
/movie/aquaman
/movie/bumblebee
/movie/between-worlds
/movie/american-renegades
/movie/mortal-engines
/movie/spider-man-into-the-spider-verse
/movie/the-quake
/movie/once-upon-a-deadpool
/movie/all-the-devils-men
/movie/dead-in-a-week-or-your-money-back
/movie/blood-brother-2018
/movie/ghostbox-cowboy
/movie/robin-hood-2018
/movie/creed-ii
/movie/outlaw-king
/movie/overlord-2018
/movie/the-girl-in-the-spiders-web
/movie/johnny-english-strikes-again
/movie/hunter-killer
/movie/bullitt-county
/movie/the-night-comes-for-us
/movie/galveston
/movie/the-oath-2018
/movie/mfkz
/movie/viking-destiny
/movie/loving-pablo
/movie/ride-2018
/movie/venom-2018
/movie/sicario-2-soldado
/movie/black-water
/movie/jurassic-world-fallen-kingdom
/movie/china-salesman
/movie/incredibles-2
/movie/superfly
/movie/believer
/movie/oceans-8
/movie/hotel-artemis
/movie/211
/movie/upgrade
/movie/adrift-2018
/movie/action-point
/movie/solo-a-star-wars-story
/movie/feral
/movie/show-dogs
/movie/deadpool-2
/movie/breaking-in
/movie/revenge
/movie/manhunt
/movie/avengers-infinity-war
/movie/supercon
/movie/love-bananas
/movie/rampage
/movie/ready-player-one
/movie/pacific-rim-uprising
/movie/tomb-raider
/movie/gringo
/movie/the-hurricane-heist

Upvotes: 1

Related Questions