Reputation: 241
I am getting error
'NoneType' object has no attribute 'encode'
when i run this code
url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})
fobj = open('D:\Scraping\parveen_urls.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
But when i use find instead of findAll i get one url. How i get all urls from object by findAll?
Upvotes: 2
Views: 7189
Reputation: 1
I found the issue belong to NULL DATA.
I fixed it by FILTER OUT NULL DATA
Upvotes: 0
Reputation: 474131
'NoneType' object has no attribute 'encode'
You are using .string
. If a tag has multiple children .string
would be None
(docs):
If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child:
Use .get_text()
instead.
Upvotes: 3
Reputation: 11
Below I provide two examples and one possible solution:
doc = ['<html><head><title>Page title</title></head>',
'<body><div class="entry-content"><div>http://teste.com</div>',
'<div>http://teste2.com</div></div></body>',
'</html>']
soup = BeautifulSoup(''.join(doc))
url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
doc = ['<html><head><title>Page title</title></head>',
'<body><div class="entry"><div>http://teste.com</div>',
'<div>http://teste2.com</div></div></body>',
'</html>']
soup = BeautifulSoup(''.join(doc))
"""
The error will rise here because the first find does not return nothing,
and nothing is equals to None. Calling "findAll" on a None object will
raise: AttributeError: 'NoneType' object has no attribute 'findAll'
"""
url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls2.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
doc = ['<html><head><title>Page title</title></head>',
'<body><div class="entry"><div>http://teste.com</div>',
'<div>http://teste2.com</div></div></body>',
'</html>']
soup = BeautifulSoup(''.join(doc))
url = soup.find('div',attrs={"class":"entry-content"})
"""
Deal with documents that do not have the expected html structure
"""
if url:
url = url.findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls2.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
else:
print("The html source does not comply with expected structure")
Upvotes: 1