Reputation: 1239
I was playing around with the BeautifulSoup and Requests APIs today. So I thought I would write a simple scraper that would follow links to a depth of 2(if that makes sense). All the links in the webpage that i am scraping are relative. (For eg: <a href="/free-man-aman-sethi/books/9788184001341.htm" title="A Free Man">
) So to make them absolute I thought I would join the page url with the relative links using urljoin
.
To do this I had to first extract the href value from the <a>
tags and for that I thought I would use split
:
#!/bin/python
#crawl.py
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
html_source=requests.get("http://www.flipkart.com/books")
soup=BeautifulSoup(html_source.content)
links=soup.find_all("a")
temp=links[0].split('"')
This gives the following error:
Traceback (most recent call last):
File "test.py", line 10, in <module>
temp=links[0].split('"')
TypeError: 'NoneType' object is not callable
Having dived in before properly going through the documentation, I realize that this is probably not the best way to achieve my objective but why is there a TypeError?
Upvotes: 3
Views: 8727
Reputation: 199
I just encountered the same error - so for what it's worth four years later: if you need to split up the soup element you can also use str() on it before you split it. In your case that would be:
temp = str(links).split('"')
Upvotes: 1
Reputation: 142146
Because the Tag
class uses proxying to access attributes (as Pavel points out - this is used to access child elements where possible), so where it's not found the None
default is returned.
convoluted example:
>>> print soup.find_all('a')[0].bob
None
>>> print soup.find_all('a')[0].foobar
None
>>> print soup.find_all('a')[0].split
None
You need to use:
soup.find_all('a')[0].get('href')
Where:
>>> print soup.find_all('a')[0].get
<bound method Tag.get of <a href="test"></a>>
Upvotes: 1
Reputation: 62908
links[0]
is not a string, it's a bs4.element.Tag
. When you try to look up split
in it, it does its magic and tries to find a subelement named split
, but there is none. You are calling that None.
In [10]: l = links[0]
In [11]: type(l)
Out[11]: bs4.element.Tag
In [17]: print l.split
None
In [18]: None() # :)
TypeError: 'NoneType' object is not callable
Use indexing to look up HTML attributes:
In [21]: links[0]['href']
Out[21]: '/?ref=1591d2c3-5613-4592-a245-ca34cbd29008&_pop=brdcrumb'
Or get
if there is a danger of nonexisting attributes:
In [24]: links[0].get('href')
Out[24]: '/?ref=1591d2c3-5613-4592-a245-ca34cbd29008&_pop=brdcrumb'
In [26]: print links[0].get('wharrgarbl')
None
In [27]: print links[0]['wharrgarbl']
KeyError: 'wharrgarbl'
Upvotes: 6