Reputation: 11
I have been following TheNewBoston's Python 3.4 tutorials that use Pycharm, and am currently on the tutorial on how to create a web crawler. I Simply want to download all of XKCD's Comics. Using the archive that seemed very easy. Here is my code, followed by TheNewBoston's.
Whenever I run the code, nothing happens. It runs through and says, "Process finished with exit code 0" Where did I screw up?
TheNewBoston's Tutorial is a little dated, and the website used for the crawl has changed domains. I will comment the part of the video that seems to matter.
My code:
mport requests
from urllib import request
from bs4 import BeautifulSoup
def download_img(image_url, page):
name = str(page) + ".jpg"
request.urlretrieve(image_url, name)
def xkcd_spirder(max_pages):
page = 1
while page <= max_pages:
url = r'http://xkcd.com/' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('div', {'img': 'src'}):
href = link.get('href')
print(href)
download_img(href, page)
page += 1
xkcd_spirder(5)
Upvotes: 1
Views: 889
Reputation: 180401
The comic is in the div with the id comic, you just need to pull the src from img inside that div then join it to the base url and finally request the content and write, I use the basename as the name to save the file under.
I also replaced your while with a range loop and did all the http requests just using requests:
import requests
from bs4 import BeautifulSoup
from os import path
from urllib.parse import urljoin # python2 -> from urlparse import urljoin
def download_img(image_url, base):
# path.basename(image_url)
# http://imgs.xkcd.com/comics/tree_cropped_(1).jpg -> tree_cropped_(1).jpg -
with open(path.basename(image_url), "wb") as f:
# image_url is a releative path, we have to join to the base
f.write(requests.get(urljoin(base,image_url)).content)
def xkcd_spirder(max_pages):
base = "http://xkcd.com/"
for page in range(1, max_pages + 1):
url = base + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
# we only want one image
img = soup.select_one("#comic img") # or .find('div',id= 'comic').img
download_img(img["src"], base)
xkcd_spirder(5)
Once you run the code you will see we get the first five comics.
Upvotes: 1