Kaitlin
Kaitlin

Reputation: 59

Python Error: 'NoneType' object has no attribute 'find_all' using Beautiful Soup

I'm having a problem with some webscraping code that I'm trying to run. To scrape information from a series of links like the following:

http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument

I am trying to scrape certain elements from the table, but I received the following error:

Python Error: 'NoneType' object has no attribute 'find_all'

I know this has to do with the fact that it's not actually finding the table because when I run the following simplified code:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import time

url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'

page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')


table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)

It returns a 'None' for the printed table, meaning the code cannot scrape any of the features of the table. I've been running similar code for similar pages and I am able to find the table just fine so I'm not sure why this is not working? I'm new to webscraping but I'd appreciate any help!

Upvotes: 1

Views: 688

Answers (3)


import pandas as pd

df = pd.read_html(
    "http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument")[0]

print(df)
df.to_csv("Data.csv", index=False, header=None)

Output: view online

enter image description here

Upvotes: 0

kareem_emad
kareem_emad

Reputation: 1183

I think the html contains some flaws that made the html parser fails to properlly parse your html, you can verify that by printing page.text and then print soup, you will find that the document has some parts removed by parser.

However lxml parser successfully parsed it with its flaw as lxml is better on ill-formatted html documents:

rom bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import time

url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'

page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')


table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)

that should catch the table tag correctly

Upvotes: 1

Frank
Frank

Reputation: 1285

So the soup doesn't parse the website content correctly, because one tag is incorrect and break the structure. You have to fix it before parse it:

url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'

page = requests.get(url)
soup = BeautifulSoup(page.text.replace("</script\n", "</script>"), 'html.parser')

table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)

Upvotes: 1

Related Questions