Scraping a html table - python

Question

I'm having a problem scraping a table from a html. Actually it is 3 tables inside a bigger table. I'm using BS4 and it works fine up to the point of finding all the 'td' tags, but when I try to print the info that I need the program stops in the end of the first table and show this error message:

"IndexError: list index out of range"

import re
import urllib2
from bs4 import BeautifulSoup

url = 'http://trackinfo.com/entries-alphabetical.jsp?raceid13=GBR$20140314A'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)


for tr in soup.find_all('tr')[2:]:
  tds = tr.find_all('td')
  print tds[0].text, tds[1].text

Any ideas how to fix it?

HAL · Accepted Answer

By looking at your code, an assumption is made in the loop that there will always be (at least) 2 td elements in the list of found tr elements. If there are some case where a tr element contains less than 2 elements, an IndexError will be raised.

Try changing the loop to something like this:

for tr in soup.find_all('tr')[2:]:
  tds = tr.find_all('td')
  if len(tds) >= 2:
    print tds[0].text, tds[1].text

The check where the number of td elements must be 2 or more is specific for the page you are parsing and I guess that you want the two values written together. A more general solution could be:

for tr in soup.find_all('tr')[2:]:
  for td in tr.find_all('td'):
    print td.text

Scraping a html table - python

Answers (2)

Related Questions