Vapiano
Vapiano

Reputation: 33

How to skip errors in web scraping with python?

I am new in python and would like to learn web scraping with python. My first project are the yellow pages in Germany.

When executing my code, I am getting following IndexError after scraping 12 pages:

('Traceback (most recent call last): File "C:/Users/Zorro/PycharmProjects/scraping/venv/Lib/site-packages/pip-19.0.3-py3.6.egg/pip/_vendor/pytoml/test.py", line 25, in city = city_container[0].text.strip() IndexError: list index out of range

Process finished with exit code 1')

I would like to know how I can skip this error, so that python does not stop scraping.

I tried to use try and except blocks, but did not succeed.

from bs4 import BeautifulSoup as soup
import requests


page_title = "/Seite-"
page_number = 1

for i in range(25):

my_url = "https://www.gelbeseiten.de/Branchen/Italienisches%20Restaurant/Berlin"

page_html = requests.get(my_url + page_title + str(page_number))
page_soup = soup(page_html.text, "html.parser")

containers = page_soup.findAll("div", {"class": "table"})

for container in containers:
    name_container = container.findAll("div", {"class": "h2"})
    name = name_container[0].text.strip()

    street_container = container.findAll("span", {"itemprop": "streetAddress"})
    street = street_container[0].text.strip()

    city_container = container.findAll("span", {"itemprop": "addressLocality"})
    city = city_container[0].text.strip()

    plz_container = container.findAll("span", {"itemprop": "postalCode"})
    plz_name = plz_container[0].text.strip()

    tele_container = container.findAll("li", {"class": "phone"})
    tele = tele_container[0].text.strip()

    print(name, "\n" + street, "\n" + plz_name + " " + city, "\n" + tele)
    print()

page_number += 1

Upvotes: 0

Views: 1071

Answers (1)

seulberg1
seulberg1

Reputation: 1013

Ok, the formatting seems to have suffered a little upon posting the code. Two things:

1) When webscraping it is usually advisable to add some downtime between consecutive scrapes to not get thrown off the server and not block too many resources. I added time.sleep(5) between every page request to wait 5 seconds before loading another page.

2) For me, try except worked just fine, if you add pass to the exception part. Of course, you can become more sophisticated in treating exceptions.

from bs4 import BeautifulSoup as soup
import requests
import time


page_title = "/Seite-"
page_number = 1

for i in range(25):
    print(page_number)
    time.sleep(5)
    my_url = "https://www.gelbeseiten.de/Branchen/Italienisches%20Restaurant/Berlin"

    page_html = requests.get(my_url + page_title + str(page_number))
    page_soup = soup(page_html.text, "html.parser")

    containers = page_soup.findAll("div", {"class": "table"})

    for container in containers:

        try:
            name_container = container.findAll("div", {"class": "h2"})
            name = name_container[0].text.strip()

            street_container = container.findAll("span", {"itemprop": "streetAddress"})
            street = street_container[0].text.strip()

            city_container = container.findAll("span", {"itemprop": "addressLocality"})
            city = city_container[0].text.strip()

            plz_container = container.findAll("span", {"itemprop": "postalCode"})
            plz_name = plz_container[0].text.strip()

            tele_container = container.findAll("li", {"class": "phone"})
            tele = tele_container[0].text.strip()

            print(name, "\n" + street, "\n" + plz_name + " " + city, "\n" + tele)
            print()

        except:
            pass

    page_number += 1

Upvotes: 2

Related Questions