M Talha Afzal
M Talha Afzal

Reputation: 241

NoneType object has no attribute 'encode' (Web Scraping)

I am getting error

'NoneType' object has no attribute 'encode'

when i run this code

url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})


 fobj = open('D:\Scraping\parveen_urls.txt', 'w')
 
 for getting in url:
   fobj.write(getting.string.encode('utf8'))

But when i use find instead of findAll i get one url. How i get all urls from object by findAll?

Upvotes: 2

Views: 7189

Answers (3)

Thanh Nguyen
Thanh Nguyen

Reputation: 1

I found the issue belong to NULL DATA.

I fixed it by FILTER OUT NULL DATA

Upvotes: 0

alecxe
alecxe

Reputation: 474131

'NoneType' object has no attribute 'encode'

You are using .string. If a tag has multiple children .string would be None (docs):

If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child:

Use .get_text() instead.

Upvotes: 3

jcfausto
jcfausto

Reputation: 11

Below I provide two examples and one possible solution:

  • Example 1 shows a working sample.
  • Example 2 shows a non working sample raising your reported error.
  • Solution shows a possible solution.

Example 1: The html have the expected div

    doc = ['<html><head><title>Page title</title></head>',
    '<body><div class="entry-content"><div>http://teste.com</div>',
    '<div>http://teste2.com</div></div></body>',
    '</html>']       
soup = BeautifulSoup(''.join(doc))
url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls.txt', 'w')
for getting in url:
  fobj.write(getting.string.encode('utf8'))

Example 2: The html does not have the expected div in the content

doc = ['<html><head><title>Page title</title></head>',
    '<body><div class="entry"><div>http://teste.com</div>',
    '<div>http://teste2.com</div></div></body>',
    '</html>']       
soup = BeautifulSoup(''.join(doc))

""" 
The error will rise here because the first find does not return nothing, 
and nothing is equals to None. Calling "findAll" on a None object will
raise: AttributeError: 'NoneType' object has no attribute 'findAll' 
"""
url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls2.txt', 'w')
for getting in url:
  fobj.write(getting.string.encode('utf8'))

Possible solution:

doc = ['<html><head><title>Page title</title></head>',
    '<body><div class="entry"><div>http://teste.com</div>',
    '<div>http://teste2.com</div></div></body>',
    '</html>']     
soup = BeautifulSoup(''.join(doc))
url = soup.find('div',attrs={"class":"entry-content"})

"""
Deal with documents that do not have the expected html structure
"""
if url:
    url = url.findAll('div', attrs={"class":None})
    fobj = open('.\parveen_urls2.txt', 'w')
    for getting in url:
        fobj.write(getting.string.encode('utf8'))
else:
    print("The html source does not comply with expected structure")

Upvotes: 1

Related Questions