Reputation: 19

Python Webscraping: How do i loop many url requests?

import requests
from bs4 import BeautifulSoup
LURL="https://www.erkunde-die-welt.de/laender-hauptstaedte-welt/"
Lpage = requests.get(LURL)
Lsoup = BeautifulSoup(Lpage.content, 'html.parser')
Lx = Lsoup.find_all(class_="column-2")
a=[]
for Lx in Lx:
  a.append(Lx.text)
a.remove("Land")
j=0
for i in range(len(a)):
  b = a[j]
  URL = "https://de.wikipedia.org/wiki/"+b
  page = requests.get(URL)
  soup = BeautifulSoup(page.content, 'html.parser')
  l = soup.find(class_="firstHeading")
  zr = soup.find(class_="wikitable infobox infoboxstaat float-right")
  z = zr.findAll("tr")
  a=""
  for z in z:
    a=a+z.text
  h=a.find("Hauptstadt")
  lol=a[h:-1]
  lol=lol.replace("Hauptstadt", "")
  lol=lol.strip()
  fg=lol.find("\n")
  lol=lol[0:fg]
  lol=lol.strip()
  j=j+1
  print(lol)
  print(l.text)

This is the code. It gets the name of every country and packs it into a list. After that the program loops through the wikipedia pages of the countrys and gets the capital of the country and prints it. It works fine for every country. But after one country is finished and code starts again it stops do work with the error:

Traceback (most recent call last):   File "main.py", line 19, in <module>
    z = zr.findAll("tr") AttributeError: 'NoneType' object has no attribute 'findAll'

Upvotes: 0

Answers (3)

Khorf

Reputation: 33

The 'NoneType' means your line zr = soup.find(class_="wikitable infobox infoboxstaat float-right") has returned nothing.

The error is in this loop :

  for Lx in Lx:
      a.append(Lx.text)

You can't use the same name there. Please try to use this loop instead and let me know how it goes:

  for L in Lx:
         a.append(Lx.text)

Upvotes: 0

Samwise

Reputation: 71562

You stored the list of countries in a variable called a, which you then overwrote later in the script with some other value. That messes up your iteration. Two good ways to prevent problems like this:

Use more meaningful variable names.
Use mypy on your Python code.

I spent a little time trying to do some basic cleanup on your code to at least get you past that first bug; the list of countries is now called countries instead of a, which prevents you from overwriting it, and I replaced the extremely confusing i/j/a/b iteration with a very simple for country in countries loop. I also got rid of all the variables that were only used once so I wouldn't have to try to come up with better names for them. I think there's more work to be done, but I don't have enough of an idea what that inner loop is doing to want to even try to fix it. Good luck!

import requests
from bs4 import BeautifulSoup

countries = [x.text for x in BeautifulSoup(
    requests.get(
        "https://www.erkunde-die-welt.de/laender-hauptstaedte-welt/"
    ).content,
    'html.parser'
).find_all(class_="column-2")]
countries.remove("Land")

for country in countries:
    soup = BeautifulSoup(
        requests.get(
            "https://de.wikipedia.org/wiki/" + country
        ).content,
        'html.parser'
    )
    heading = soup.find(class_="firstHeading")
    rows = soup.find(
        class_="wikitable infobox infoboxstaat float-right"
    ).findAll("tr")
    a = ""
    for row in rows:
        a += row.text
        h = a.find("Hauptstadt")
        lol = a[h:-1]
        lol = lol.replace("Hauptstadt", "")
        lol = lol.strip()
        fg = lol.find("\n")
        lol = lol[0:fg]
        lol = lol.strip()
        print(lol)
        print(heading.text)

Upvotes: 1

Matt L.

Reputation: 3631

The error message is actually telling you what's happening. The line of code

z = zr.findAll("tr")

is throwing an attribute error because the NoneType object does not have a findAll attribute. You are trying to call findAll on zr, assuming that variable will always be a BeautifulSoup object, but it won't. If this line:

zr = soup.find(class_="wikitable infobox infoboxstaat float-right")

finds no objects in the html matching those classes, zr will be set to None. So, on one of the pages you are trying to scrape, that's what happening. You can code around it with a try/except statement, like this:

for i in range(len(a)):
    b = a[j]
    URL = "https://de.wikipedia.org/wiki/"+b
    page = requests.get(URL)
    try:
        soup = BeautifulSoup(page.content, 'html.parser')
        l = soup.find(class_="firstHeading")
        zr = soup.find(class_="wikitable infobox infoboxstaat float-right")
        z = zr.findAll("tr")
        a=""
        #don't do this! should be 'for i in z' or something other variable name
        for z in z:
            a=a+z.text
            h=a.find("Hauptstadt")
            lol=a[h:-1]
            lol=lol.replace("Hauptstadt", "")
            lol=lol.strip()
            fg=lol.find("\n")
            lol=lol[0:fg]
            lol=lol.strip()
            j=j+1
            print(lol)
            print(l.text)
    except:
        pass

In this example, any page that doesn't have the right html tags will be skipped.

Upvotes: 1

Python Webscraping: How do i loop many url requests?

Answers (3)

Related Questions