user2121361
user2121361

Reputation:

How to ignore a th tag while parsing html table?

Hello I am quite new to parsing html tables with python and beautifulsoup4. All has been going well until I have run into this weird table which uses a 'th' tag midway through the table, causing my parse to quit and throw an 'index is out of range' error. I've tried searching SO and google to no avail. The question is how would I ignore or strip this rogue 'th' tag while parsing the table?

Here is the code I have so far:

from mechanize import Browser
from bs4 import BeautifulSoup

mech = Browser()
url = 'https://www.moscone.com/site/do/event/list'
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
table = soup.find('table', { 'id' : 'list' })

for row in table.findAll('tr')[3:]:
    col = row.findAll('td')
    date = col[0].string
    name = col[1].string
    location = col[2].string
    record = (name, date, location)
    final = ','.join(record)
    print(final)

Here is a small snippet of the html that causes my error

  <td>
   Convention
  </td>
 </tr>
 <tr>
  <th class="title" colspan="4">
   Mon Dec 01 00:00:00 PST 2014
  </th>
 </tr>
 <tr>
  <td>
   12/06/14 - 12/09/14
  </td>

I do want the data above and below this rogue 'th' that indicates the start of a new month on the table

Upvotes: 1

Views: 868

Answers (1)

Anzel
Anzel

Reputation: 20553

You can just check if th is in the row and parse the content if not, like this:

for row in table.findAll('tr')[3:]:
    # so make sure th is not in row
    if not row.find_all('th'):
        col = row.findAll('td')
        date = col[0].string
        name = col[1].string
        location = col[2].string
        record = (name, date, location)
        final = ','.join(record)
        print(final)

This are the results I will get from your provided url without IndexError:

Out & Equal Workplace,11/03/14 - 11/06/14,Moscone West 
Samsung Developer Conference,11/11/14 - 11/13/14,Moscone West  
North American Spine Society (NASS) Annual Meeting,11/12/14 - 11/15/14,Moscone South and Esplanade Ballroom 
San Francisco International Auto Show,11/22/14 - 11/29/14,Moscone North & South 
67th Annual Meeting of the APS Division of Fluid Dynamics,11/23/14 - 11/25/14,Moscone North, South and West 
American Society of Hematology,12/06/14 - 12/09/14,Moscone North, South and West 
California School Boards Association,12/12/14 - 12/16/14,Moscone North & Esplanade Ballroom 
American Geophysical Union,12/15/14 - 12/19/14,Moscone North & South

Upvotes: 1

Related Questions