Web scraping table with hidden part using python

Question

I´m trying to get the information from this table:

No Si 
100%(15) 0%(0) Más información

I´m doing the following in python3:

req = Request('http://www.congresovisible.org/votaciones/10918/',headers=headers)
web_page = urlopen(req)
soup = BeautifulSoup(web_page.read(), 'html.parser')
table= soup.find_all('table', attrs={'class':'table4 table4-1 table4-1-1'})

This works but only shows part of the table, it excludes everything after:

NoSi100%(15)]

How could I extract the whole table?

alecxe · Accepted Answer

It is actually quite easy to solve. html.parser does not parse this kind of non-well-formed HTML well. Use a more lenient html5lib instead. This works for me:

import requests
from bs4 import BeautifulSoup

response = requests.get('http://www.congresovisible.org/votaciones/10918/')
soup = BeautifulSoup(response.content, 'html5lib')
table = soup.find_all('table', attrs={'class':'table4 table4-1 table4-1-1'})
print(table)

Note that this requires html5lib package to be installed:

pip install --upgrade html5lib

By the way, lxml parser works as well:

soup = BeautifulSoup(response.content, 'lxml')

Web scraping table with hidden part using python

Answers (1)

Related Questions