Reputation: 23
I am a beginner and struggling though a course, so this problem is probably really simple, but I am running this (admittedly messy) code (saved under file x.py) to extract a link and a name from a website with line formats like:
<li style="margin-top: 21px;">
<a href="http://py4e-data.dr-chuck.net/known_by_Prabhjoit.html">Prabhjoit</a>
</li>
So I set up this: import urllib.request, urllib.parse, urllib.error from bs4 import BeautifulSoup import ssl # Ignore SSL certificate errors ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
for line in soup:
if not line.startswith('<li'):
continue
stuff = line.split('"')
link = stuff[3]
thing = stuff[4].split('<')
name = thing[0].split('>')
count = count + 1
if count == 18:
break
print(name[1])
print(link)
And it keeps producing the error:
Traceback (most recent call last):
File "x.py", line 15, in <module>
if not line.startswith('<li'):
TypeError: 'NoneType' object is not callable
I have struggled with this for hours, and I would be grateful for any suggestions.
Upvotes: 0
Views: 2791
Reputation: 1122352
line
is not a string, and it has no startswith()
method. It is a BeautifulSoup Tag
object, because BeautifulSoup has parsed the HTML source text into a rich object model. Don't try to treat it as text!
The error is caused because if you access any attribute on a Tag
object that it doesn't know about, it does a search for a child element with that name (so here it executes line.find('startswith')
), and since there is no element with that name, None
is returned. None.startswith()
then fails with the error you see.
If you wanted to find the 18th <li>
element, just ask BeautifulSoup for that specific element:
soup = BeautifulSoup(html, 'html.parser')
li_link_elements = soup.select('li a[href]', limit=18)
if len(li_link_elements) == 18:
last = li_link_elements[-1]
print(last.get_text())
print(last['href'])
This uses a CSS selector to find only the <a>
link elements whose parent is a <li>
element and that have a href
attribute. The search is limited to just 18 such tags, and the last one is printed, but only if we actually found 18 in the page.
The element text is retrieved with the Element.get_text()
method, which will include text from any nested elements (such as <span>
or <strong>
or other extra markup), and the href
attribute is accessed using standard indexing notation.
Upvotes: 1