Jack Brand
Jack Brand

Reputation: 11

Python: Simple Web Crawler using BeautifulSoup4

I have been following TheNewBoston's Python 3.4 tutorials that use Pycharm, and am currently on the tutorial on how to create a web crawler. I Simply want to download all of XKCD's Comics. Using the archive that seemed very easy. Here is my code, followed by TheNewBoston's. Whenever I run the code, nothing happens. It runs through and says, "Process finished with exit code 0" Where did I screw up?
TheNewBoston's Tutorial is a little dated, and the website used for the crawl has changed domains. I will comment the part of the video that seems to matter.

My code:

mport requests
from urllib import request
from bs4 import BeautifulSoup

def download_img(image_url, page):
    name = str(page) + ".jpg"
    request.urlretrieve(image_url, name)


def xkcd_spirder(max_pages):
    page = 1
    while page <= max_pages:
        url = r'http://xkcd.com/' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll('div', {'img': 'src'}):
            href = link.get('href')
            print(href)
            download_img(href, page)
        page += 1

xkcd_spirder(5)

Upvotes: 1

Views: 889

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180401

The comic is in the div with the id comic, you just need to pull the src from img inside that div then join it to the base url and finally request the content and write, I use the basename as the name to save the file under.

I also replaced your while with a range loop and did all the http requests just using requests:

import requests
from bs4 import BeautifulSoup
from os import path
from urllib.parse import urljoin # python2 -> from urlparse import urljoin 


def download_img(image_url, base):
     # path.basename(image_url) 
    #  http://imgs.xkcd.com/comics/tree_cropped_(1).jpg -> tree_cropped_(1).jpg -
    with open(path.basename(image_url), "wb") as f:
        # image_url is a releative path, we have to join to the base 
        f.write(requests.get(urljoin(base,image_url)).content)


def xkcd_spirder(max_pages):
    base = "http://xkcd.com/"
    for page in range(1, max_pages + 1):
        url = base + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        # we only want one image
        img = soup.select_one("#comic img") # or .find('div',id= 'comic').img
        download_img(img["src"], base)

xkcd_spirder(5)

Once you run the code you will see we get the first five comics.

Upvotes: 1

Related Questions