Parsing HTML Tables with BS4

I've been trying different methods of scraping data from this site (http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=WR&college=) and can't seem to get any of them to work. I've tried playing with the indices given, but can't seem to make it work. I think I've tried too many things at this point,so if someone could point me in the right direction I would really appreciate it.

I would like to pull all of the information and export it to a .csv file, but at this point I'm just trying to get the name and position to print to get started.

Here's my code:

import urllib2
from bs4 import BeautifulSoup
import re

url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')

page = urllib2.urlopen(url).read()

soup = BeautifulSoup(page)
table = soup.find('table')

for row in table.findAll('tr')[0:]:
    col = row.findAll('tr')
    name = col[1].string
    position = col[3].string
    player = (name, position)
    print "|".join(player)

Here's the error I'm getting: line 14, in name = col[1].string IndexError: list index out of range.

--UPDATE--

Ok, I've made a little progress. It now allows me to go from start to finish, but it requires knowing how many rows are in the table. How would I get it to just go through them until the end? Updated Code:

import urllib2
from bs4 import BeautifulSoup
import re

url = ('http://nflcombineresults.com/nflcombinedata.php?year=1999&pos=&college=')

page = urllib2.urlopen(url).read()

soup = BeautifulSoup(page)
table = soup.find('table')


for row in table.findAll('tr')[1:250]:
    col = row.findAll('td')
    name = col[1].getText()
    position = col[3].getText()
    player = (name, position)
    print "|".join(player)

Upvotes: 5

Answers (3)

sankalp

Reputation: 446

Simple way to parse the table column wise:

def table_to_list(table):
    data = []
    all_th = table.find_all('th')
    all_heads = [th.get_text() for th in all_th]
    for tr in table.find_all('tr'):
        all_th = tr.find_all('th')
        if all_th:
            continue
        all_td = tr.find_all('td')
        data.append([td.get_text() for td in all_td])
    return list(zip(all_heads, *data))

r = requests.get(url, headers=headers)
bs = BeautifulSoup(r.text)
all_tables = bs.find_all('table')
table_to_list(all_tables[0])

Upvotes: 0

ISuckAtLife

Reputation: 235

I figured it out after only 8 hours or so. Learning is fun. Thanks for the help Kevin! It now includes the code to output the scraped data to a csv file. Next up is taking that data and filtering out for certain positions....

Here's my code:

import urllib2
from bs4 import BeautifulSoup
import csv

url = ('http://nflcombineresults.com/nflcombinedata.php?year=2000&pos=&college=')

page = urllib2.urlopen(url).read()

soup = BeautifulSoup(page)
table = soup.find('table')

f = csv.writer(open("2000scrape.csv", "w"))
f.writerow(["Name", "Position", "Height", "Weight", "40-yd", "Bench", "Vertical", "Broad", "Shuttle", "3-Cone"])
# variable to check length of rows
x = (len(table.findAll('tr')) - 1)
# set to run through x
for row in table.findAll('tr')[1:x]:
    col = row.findAll('td')
    name = col[1].getText()
    position = col[3].getText()
    height = col[4].getText()
    weight = col[5].getText()
    forty = col[7].getText()
    bench = col[8].getText()
    vertical = col[9].getText()
    broad = col[10].getText()
    shuttle = col[11].getText()
    threecone = col[12].getText()
    player = (name, position, height, weight, forty, bench, vertical, broad, shuttle, threecone, )
    f.writerow(player)

Upvotes: 12

Kevin

Reputation: 76264

I can't run your script due to firewall permissions, but I believe the problem is on this line:

col = row.findAll('tr')

row is a tr tag, and you're asking BeautifulSoup to find all tr tags within that tr tag. You probably meant to do:

col = row.findAll('td')

Furthermore, since the actual text isn't directly inside the tds but is also hidden within nested divs and as, it may be useful to use the getText method instead of .string:

name = col[1].getText()
position = col[3].getText()

Upvotes: 2

Parsing HTML Tables with BS4

Answers (3)

Related Questions