Extra HTML tag causing problems with bs4

Question

I am trying to grab some information from a table on the site http://www.house.gov/representatives/ Specifically, I want to get information on representatives from the "Representative Directory By Last Name" tables. So far, I am able to download the HTML from the site and write it to a file, but when using bs4 to parse and grab the specific tables I want, it is only grabbing the first row of each table.

This is because there is an extra tag in each row of the HTML table:



Abraham, Ralph  

Louisiana 5th District
R
417 CHOB
202-225-8490
Agriculture
Armed Services
Science, Space, and Technology

That last /td tag is somehow causing bs4 to not grab the rest of the rows. I did test manually going in and deleting some of the extra tags and I got back all the rows, so I know that extra tag is the problem. Here is my python code so far:

import bs4, requests

res = requests.get('http://www.house.gov/representatives/')
res.raise_for_status()
file = open('HouseReps.html', 'wb')
for chunk in res.iter_content(100000):
    file.write(chunk)
file = open('HouseReps.html')
soup = bs4.BeautifulSoup(file, 'html.parser')
table = soup.select('table[title="Representative Directory By Last Name"]')
print(table)

I've also tried to using prettify() but that did not help either. Any ideas on how to clean up the HTML so I can use bs4 (or something else) to parse and extract the tables I need?

Thanks!

Tiny.D · Accepted Answer

You could use the lxml parser instead of html.parser in your code :

import bs4, requests

res = requests.get('http://www.house.gov/representatives/')
res.raise_for_status()
file = open('HouseReps.html', 'wb')
for chunk in res.iter_content(100000):
    file.write(chunk)
file = open('HouseReps.html')
soup = bs4.BeautifulSoup(file, 'lxml') #use the `lxml` parser instead of `html.parser`
table = soup.findAll("table",{"title":"Representative Directory By Last Name"})
print(table[0]) #print first table

The output will show you the full first table with "title" = "Representative Directory By Last Name":












Name
District
Party
Room
Phone
Committee Assignment





Abraham, Ralph  

Louisiana 5th District
R
417 CHOB
202-225-8490
Agriculture
Armed Services
Science, Space, and Technology



Adams, Alma 

North Carolina 12th District
D
222 CHOB
202-225-1510
Agriculture
Education and the Workforce
Small Business



Aderholt, Robert 

Alabama 4th District
R
235 CHOB
202-225-4876
Appropriations



Aguilar, Pete 

California 31st District
D
1223 LHOB
202-225-3201
Appropriations



Allen, Rick 

Georgia 12th District
R
426 CHOB
202-225-2823
Agriculture
Education and the Workforce



Amash, Justin 

Michigan 3rd District
R
114 CHOB
202-225-3831
Oversight and Government



Amodei, Mark 

Nevada 2nd District
R
332 CHOB
202-225-6155
Appropriations



Arrington, Jodey  

Texas 19th District
R
1029 LHOB
202-225-4005
Agriculture
the Budget
Veterans' Affairs

Extra HTML tag causing problems with bs4

Answers (1)

Related Questions