Reputation: 179
I want to scrape all the URLs from multiple web pages. It works, but only the results from the last web page are saved in the file.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests
urls=['https://www.metacritic.com/browse/movies/genre/date?page=0', 'https://www.metacritic.com/browse/movies/genre/date?page=2', '...']
for url in urls:
response = requests.get(url)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()
soup = BeautifulSoup(html_page, features="html.parser")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^/movie/([a-zA-Z0-9\-])+$")}):
links.append(link.get('href'))
filename = 'output.csv'
with open(filename, mode="w") as outfile:
for s in links:
outfile.write("%s\n" % s)
What am I missing here?
It would even be cooler if I could use a csv file with all the urls instead of the list. But anything I tried was way off...
Upvotes: 0
Views: 327
Reputation: 811
Hey this is my first answer so ill try my best to help.
The problem with the data overwrite is that you're iterating through your urls in one loop, then iterating through the soup object in another loop.
This will always return the last soup object at the end of the loop so the best thing to do would be to either append each soup object to an array from within the url loop or actually query the soup object when in the url loop:
soup_obj_list = []
for url in urls:
response = requests.get(url)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()
soup = BeautifulSoup(html_page, features="html.parser")
soup_obj_list.append(soup)
hope that solves your first problem. cant really help with the csv issue.
Upvotes: 1
Reputation: 1123
You are using last soup of your urls. You should move your second for each into the first one. Also you are getting all the elements matching with your regex. There are elements outside of the table you are trying to scrape.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests
urls=['https://www.metacritic.com/browse/movies/genre/date?page=0', 'https://www.metacritic.com/browse/movies/genre/date?page=2']
links = []
for url in urls:
response = requests.get(url)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()
soup = BeautifulSoup(html_page, features="html.parser")
#You should get only movies from list otherwise you will also append coming soon section. That is why we added select_one
for link in soup.select_one('ol.list_products').findAll('a', attrs={'href': re.compile("^/movie/([a-zA-Z0-9\-])+$")}):
links.append(link.get('href'))
filename = 'output.csv'
with open(filename, mode="w") as outfile:
for s in links:
outfile.write("%s\n" % s)
Here is the result.
/movie/woman-at-war
/movie/destroyer
/movie/aquaman
/movie/bumblebee
/movie/between-worlds
/movie/american-renegades
/movie/mortal-engines
/movie/spider-man-into-the-spider-verse
/movie/the-quake
/movie/once-upon-a-deadpool
/movie/all-the-devils-men
/movie/dead-in-a-week-or-your-money-back
/movie/blood-brother-2018
/movie/ghostbox-cowboy
/movie/robin-hood-2018
/movie/creed-ii
/movie/outlaw-king
/movie/overlord-2018
/movie/the-girl-in-the-spiders-web
/movie/johnny-english-strikes-again
/movie/hunter-killer
/movie/bullitt-county
/movie/the-night-comes-for-us
/movie/galveston
/movie/the-oath-2018
/movie/mfkz
/movie/viking-destiny
/movie/loving-pablo
/movie/ride-2018
/movie/venom-2018
/movie/sicario-2-soldado
/movie/black-water
/movie/jurassic-world-fallen-kingdom
/movie/china-salesman
/movie/incredibles-2
/movie/superfly
/movie/believer
/movie/oceans-8
/movie/hotel-artemis
/movie/211
/movie/upgrade
/movie/adrift-2018
/movie/action-point
/movie/solo-a-star-wars-story
/movie/feral
/movie/show-dogs
/movie/deadpool-2
/movie/breaking-in
/movie/revenge
/movie/manhunt
/movie/avengers-infinity-war
/movie/supercon
/movie/love-bananas
/movie/rampage
/movie/ready-player-one
/movie/pacific-rim-uprising
/movie/tomb-raider
/movie/gringo
/movie/the-hurricane-heist
Upvotes: 1