Reputation: 73
import requests
from bs4 import BeautifulSoup
def getHTMLText(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status() # generate error information
r.encoding = r.apparent_encoding # could be revised to enhance the speed
return r.next # return the HTML to other parts of the programmme
except:
return ""
def fillUnivKust(ulist, html):
soup = BeautifulSoup(html, "html.parser")
for a in soup.find('li').children:
if isinstance(a, bs4.element.Tag): # avoid String type's data
aaa = a('div') # There are only 2 divs here in this case
ulist.append([aaa[0].string]) # aaa[0] -> Product's name
def printUnivList(ulist, num):
for i in range(num):
u = ulist[i] # u already have
print(u[i]) # print the ith product's name
def main():
uinfo = []
url = 'https://www.cattelanitalia.com/en/products?c=new'
html = getHTMLText(url)
fillUnivKust(uinfo, html)
printUnivList(uinfo, 25)
main()
I tried to write a simple Python Web Scraper, the code is above, the Web Scraper only include this much code. After running it, I received an error says:
TypeError: object of type 'NoneType' has no len()
I don't know where it is wrong.
-- Update --
I changed '''return r.next''' to '''return r.content'''
It generates this error:
IndexError: list index out of range
I don't know why again.
Upvotes: 1
Views: 879
Reputation: 945
Your issue is comming from this line:
soup = BeautifulSoup(html, "html.parser")
The TypeError
when initialising the BeautifulSoup
class tells us that BeautifulSoup tried to perform a len
operation but was unable to do so on a NoneType object. Ergo, the data passed in as the html
argument (i.e. the first positional argument) was a NoneType
rather than an HTML doc.
So why is the HTML a NoneType
? That arrises from your getHTMLText
function, specifically the line:
return r.next
is returning None for the URL you provided in main. The line r = requests.get()
is returning a requests.Response
object, and from that .next
returns a "PreparedRequest for the next request in a redirect chain, if there is one." [source] - i.e. not an HTML element. You probably want to update that line to:
return r.content
as per this tutorial
A couple of side notes:
getHTMLText
you'll return an empty string, not HTML so I presume that this will also error when initialising the BeautifulSoup
.r
variable and in general avoid single character variables names as it will make your life easier for yourself when you start using the debugger :)Upvotes: 1
Reputation: 1084
Looks like there is a typo in getHTMLText()
- try replacing return r.next
with return r.text
.
Upvotes: 0