Reputation: 111
I'm having a problem scraping a table from a html. Actually it is 3 tables inside a bigger table. I'm using BS4 and it works fine up to the point of finding all the 'td' tags, but when I try to print the info that I need the program stops in the end of the first table and show this error message:
"IndexError: list index out of range"
import re
import urllib2
from bs4 import BeautifulSoup
url = 'http://trackinfo.com/entries-alphabetical.jsp?raceid13=GBR$20140314A'
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
for tr in soup.find_all('tr')[2:]:
tds = tr.find_all('td')
print tds[0].text, tds[1].text
Any ideas how to fix it?
Upvotes: 1
Views: 448
Reputation: 473763
The idea is to iterate over tables inside the top-level table, then for each table iterate over rows (except the first one with titles):
import urllib2
from bs4 import BeautifulSoup
url = 'http://trackinfo.com/entries-alphabetical.jsp?raceid13=GBR$20140314A'
soup = BeautifulSoup(urllib2.urlopen(url))
for index, table in enumerate(soup.find('table').find_all('table')):
print "Table #%d" % index
for tr in table.find_all('tr')[1:]:
tds = tr.find_all('td')
print "Runner: %s, Race: %s" % (tds[0].text.strip(), tds[1].text.strip())
prints:
Table #0
Runner: ALL SHOOK UP, Race: 11
Runner: ARLINGTON ADIE, Race: 9
Runner: BARTS BIKERCHICK, Race: 10
Runner: BARTS GAME DAY, Race: 4
Runner: BARTS SIR PRIZE, Race: 7
Runner: BJ'S PIZAZZ, Race: 7
Runner: BOC'S BAMA BOY, Race: 14
Runner: BOC'S BRADBERRY, Race: 2
Runner: BOC'S CRIMSNTIDE, Race: 9
...
Also, note that you can pass urllib2.urlopen(url)
directly to the BeautifulSoup
constructor - it will call read()
under the hood.
Hope that helps.
Upvotes: 1
Reputation: 2061
By looking at your code, an assumption is made in the loop that there will always be (at least) 2 td
elements in the list of found tr
elements. If there are some case where a tr
element contains less than 2 elements, an IndexError will be raised.
Try changing the loop to something like this:
for tr in soup.find_all('tr')[2:]:
tds = tr.find_all('td')
if len(tds) >= 2:
print tds[0].text, tds[1].text
The check where the number of td
elements must be 2 or more is specific for the page you are parsing and I guess that you want the two values written together. A more general solution could be:
for tr in soup.find_all('tr')[2:]:
for td in tr.find_all('td'):
print td.text
Upvotes: 1