Reputation: 633
I am trying to practice web scraping by the book "Web Scraping with Python",,,however,I got an error "AttributeError: 'NoneType' object has no attribute 'get_text'",,,can anyone tell me how can I solve it,very thanks,,,here is the code and error message,,,I use Python3 and MySql db and Mac os
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import pymysql
import re
conn = pymysql.connect(host='127.0.0.1',unix_socket='/tmp/mysql.sock',user='root',passwd=None,db='mysql',charset='utf8')
cur = conn.cursor()
cur.execute("USE scraping")
random.seed(datetime.datetime.now())
def store(title, content):
cur.execute("INSERT INTO pages (title, content) VALUES (\"%s\",\"%s\")", (title, content))
cur.connection.commit()
def getLinks(articleUrl):
html = urlopen("http://en.wikipedia.org"+articleUrl)
bsObj = BeautifulSoup(html, "html.parser")
title = bsObj.find("h1").find("span").get_text()
content = bsObj.find("div", {"id":"mw-content- text"}).find("p").get_text()
store(title, content)
return bsObj.find("div", {"id":"bodyContent"}).findAll("a",href=re.compile("^(/wiki/)((?!:).)*$"))
links = getLinks("/wiki/Kevin_Bacon")
try:
while len(links) > 0:
newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
print(newArticle)
links = getLinks(newArticle)
finally:
cur.close()
conn.close()
Upvotes: 0
Views: 3323
Reputation: 22282
Because there's no span
html tag inside the h1
tag. If you want to get the text in h1
, just remove that .find('span')
:
title = bsObj.find("h1").get_text()
And there's a space in
content = bsObj.find("div", {"id":"mw-content- text"}).find("p").get_text()
^
Remove it:
content = bsObj.find("div", {"id":"mw-content-text"}).find("p").get_text()
Then your code maybe works fine.
Upvotes: 2