Reputation:
Hello I am quite new to parsing html tables with python and beautifulsoup4. All has been going well until I have run into this weird table which uses a 'th' tag midway through the table, causing my parse to quit and throw an 'index is out of range' error. I've tried searching SO and google to no avail. The question is how would I ignore or strip this rogue 'th' tag while parsing the table?
Here is the code I have so far:
from mechanize import Browser
from bs4 import BeautifulSoup
mech = Browser()
url = 'https://www.moscone.com/site/do/event/list'
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
table = soup.find('table', { 'id' : 'list' })
for row in table.findAll('tr')[3:]:
col = row.findAll('td')
date = col[0].string
name = col[1].string
location = col[2].string
record = (name, date, location)
final = ','.join(record)
print(final)
Here is a small snippet of the html that causes my error
<td>
Convention
</td>
</tr>
<tr>
<th class="title" colspan="4">
Mon Dec 01 00:00:00 PST 2014
</th>
</tr>
<tr>
<td>
12/06/14 - 12/09/14
</td>
I do want the data above and below this rogue 'th' that indicates the start of a new month on the table
Upvotes: 1
Views: 868
Reputation: 20553
You can just check if th
is in the row
and parse the content if not, like this:
for row in table.findAll('tr')[3:]:
# so make sure th is not in row
if not row.find_all('th'):
col = row.findAll('td')
date = col[0].string
name = col[1].string
location = col[2].string
record = (name, date, location)
final = ','.join(record)
print(final)
This are the results I will get from your provided url without IndexError:
Out & Equal Workplace,11/03/14 - 11/06/14,Moscone West
Samsung Developer Conference,11/11/14 - 11/13/14,Moscone West
North American Spine Society (NASS) Annual Meeting,11/12/14 - 11/15/14,Moscone South and Esplanade Ballroom
San Francisco International Auto Show,11/22/14 - 11/29/14,Moscone North & South
67th Annual Meeting of the APS Division of Fluid Dynamics,11/23/14 - 11/25/14,Moscone North, South and West
American Society of Hematology,12/06/14 - 12/09/14,Moscone North, South and West
California School Boards Association,12/12/14 - 12/16/14,Moscone North & Esplanade Ballroom
American Geophysical Union,12/15/14 - 12/19/14,Moscone North & South
Upvotes: 1