Reputation: 19
import requests
from bs4 import BeautifulSoup
LURL="https://www.erkunde-die-welt.de/laender-hauptstaedte-welt/"
Lpage = requests.get(LURL)
Lsoup = BeautifulSoup(Lpage.content, 'html.parser')
Lx = Lsoup.find_all(class_="column-2")
a=[]
for Lx in Lx:
a.append(Lx.text)
a.remove("Land")
j=0
for i in range(len(a)):
b = a[j]
URL = "https://de.wikipedia.org/wiki/"+b
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
l = soup.find(class_="firstHeading")
zr = soup.find(class_="wikitable infobox infoboxstaat float-right")
z = zr.findAll("tr")
a=""
for z in z:
a=a+z.text
h=a.find("Hauptstadt")
lol=a[h:-1]
lol=lol.replace("Hauptstadt", "")
lol=lol.strip()
fg=lol.find("\n")
lol=lol[0:fg]
lol=lol.strip()
j=j+1
print(lol)
print(l.text)
This is the code. It gets the name of every country and packs it into a list. After that the program loops through the wikipedia pages of the countrys and gets the capital of the country and prints it. It works fine for every country. But after one country is finished and code starts again it stops do work with the error:
Traceback (most recent call last): File "main.py", line 19, in <module>
z = zr.findAll("tr") AttributeError: 'NoneType' object has no attribute 'findAll'
Upvotes: 0
Views: 73
Reputation: 33
The 'NoneType' means your line zr = soup.find(class_="wikitable infobox infoboxstaat float-right")
has returned nothing.
The error is in this loop :
for Lx in Lx:
a.append(Lx.text)
You can't use the same name there. Please try to use this loop instead and let me know how it goes:
for L in Lx:
a.append(Lx.text)
Upvotes: 0
Reputation: 71562
You stored the list of countries in a variable called a
, which you then overwrote later in the script with some other value. That messes up your iteration. Two good ways to prevent problems like this:
mypy
on your Python code.I spent a little time trying to do some basic cleanup on your code to at least get you past that first bug; the list of countries is now called countries
instead of a
, which prevents you from overwriting it, and I replaced the extremely confusing i/j/a/b
iteration with a very simple for country in countries
loop. I also got rid of all the variables that were only used once so I wouldn't have to try to come up with better names for them. I think there's more work to be done, but I don't have enough of an idea what that inner loop is doing to want to even try to fix it. Good luck!
import requests
from bs4 import BeautifulSoup
countries = [x.text for x in BeautifulSoup(
requests.get(
"https://www.erkunde-die-welt.de/laender-hauptstaedte-welt/"
).content,
'html.parser'
).find_all(class_="column-2")]
countries.remove("Land")
for country in countries:
soup = BeautifulSoup(
requests.get(
"https://de.wikipedia.org/wiki/" + country
).content,
'html.parser'
)
heading = soup.find(class_="firstHeading")
rows = soup.find(
class_="wikitable infobox infoboxstaat float-right"
).findAll("tr")
a = ""
for row in rows:
a += row.text
h = a.find("Hauptstadt")
lol = a[h:-1]
lol = lol.replace("Hauptstadt", "")
lol = lol.strip()
fg = lol.find("\n")
lol = lol[0:fg]
lol = lol.strip()
print(lol)
print(heading.text)
Upvotes: 1
Reputation: 3631
The error message is actually telling you what's happening. The line of code
z = zr.findAll("tr")
is throwing an attribute error because the NoneType object does not have a findAll
attribute. You are trying to call findAll
on zr, assuming that variable will always be a BeautifulSoup object, but it won't. If this line:
zr = soup.find(class_="wikitable infobox infoboxstaat float-right")
finds no objects in the html matching those classes, zr will be set to None
. So, on one of the pages you are trying to scrape, that's what happening. You can code around it with a try/except statement, like this:
for i in range(len(a)):
b = a[j]
URL = "https://de.wikipedia.org/wiki/"+b
page = requests.get(URL)
try:
soup = BeautifulSoup(page.content, 'html.parser')
l = soup.find(class_="firstHeading")
zr = soup.find(class_="wikitable infobox infoboxstaat float-right")
z = zr.findAll("tr")
a=""
#don't do this! should be 'for i in z' or something other variable name
for z in z:
a=a+z.text
h=a.find("Hauptstadt")
lol=a[h:-1]
lol=lol.replace("Hauptstadt", "")
lol=lol.strip()
fg=lol.find("\n")
lol=lol[0:fg]
lol=lol.strip()
j=j+1
print(lol)
print(l.text)
except:
pass
In this example, any page that doesn't have the right html tags will be skipped.
Upvotes: 1