e9e9s
e9e9s

Reputation: 955

Scraping an html table with beautiful soup into pandas

I'm trying to scrape an html table using beautiful soup and import it into pandas -- http://www.baseball-reference.com/teams/NYM/2017.shtml -- the "Team Batting" table.

Finding the table is no problem:

table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')

Finding the rows of data isn't a problem either:

for i in table.findAll('tr')[2]: #increase to 3 to get next row in table...
    print(i.get_text())

And I can even find the header names:

table_head = table.find('thead')

for i in table_head.findAll('th'):
    print(i.get_text())

Now I'm having trouble putting everything together into a data frame. Here's what I have so far:

header = []    
for th in table_head.findAll('th'):
        key = th.get_text()
        header.append(key)

row= []
for tr in table.findAll('tr')[2]:
    value = tr.get_text()
    row.append(value)

od = OrderedDict(zip(head, row))
df = pd.DataFrame(d1, index=[0])

This only works for one row at a time. My question is how can I do this for every row in the table at the same time?

Upvotes: 2

Views: 6185

Answers (2)

Abdou
Abdou

Reputation: 13274

This solution uses only pandas, but it cheats a little by knowing in advance that the team batting table is the tenth table. With that knowledge, the following uses pandas's read_html function and grabbing the tenth DataFrame from the list of returned DataFrame objects. The remaining after that is just some data cleaning:

import pandas as pd


url = 'http://www.baseball-reference.com/teams/NYM/2017.shtml'

# Take 10th dataframe
team_batting = pd.read_html(url)[9]

# Take columns whose names don't contain "Unnamed"
team_batting.drop([x for x in team_batting.columns if 'Unnamed' in x], axis=1, inplace=True)

# Remove the rows that are just a copy of the headers/columns
team_batting = team_batting.ix[team_batting.apply(lambda x: x != team_batting.columns,axis=1).all(axis=1),:]

# Take out the Totals rows
team_batting = team_batting.ix[~team_batting.Rk.isnull(),:]

# Get a glimpse of the data
print(team_batting.head(5))

#   Rk Pos               Name Age   G  PA  AB   R   H 2B ...   OBP   SLG   OPS OPS+  TB GDP HBP SH SF IBB
# 0  1   C    Travis d'Arnaud  28  12  42  37   6  10  2 ...  .357  .541  .898  144  20   1   1  0  0   1
# 1  2  1B        Lucas Duda*  31  13  50  42   4  10  2 ...  .360  .571  .931  153  24   1   0  0  0   2
# 2  3  2B       Neil Walker#  31  14  62  54   5  12  3 ...  .306  .278  .584   64  15   2   0  0  1   0
# 3  4  SS  Asdrubal Cabrera#  31  15  67  63  10  17  2 ...  .313  .397  .710   96  25   0   0  0  0   0
# 4  5  3B        Jose Reyes#  34  15  59  53   3   5  2 ...  .186  .132  .319   -9   7   0   0  0  0   0

I hope this helps.

Upvotes: 0

mechanical_meat
mechanical_meat

Reputation: 169284

I have tested that the below will work for your purposes. Basically you need to create a list, loop over the players, use that list to populate a DataFrame. It is advisable to not create the DataFrame row by row as that will probably be significantly slower.

import collections as co
import pandas as pd

from bs4 import BeautifulSoup

with open('team_batting.html','r') as fin:
    soup = BeautifulSoup(fin.read(),'lxml')

table = soup.find('div', attrs={'class': 'overthrow table_container'})
table_body = table.find('tbody')

table_head = table.find('thead')
header = []    
for th in table_head.findAll('th'):
    key = th.get_text()
    header.append(key)

# loop over table to find number of rows with '' in first column
endrows = 0
for tr in table.findAll('tr'):
    if tr.findAll('th')[0].get_text() in (''):
        endrows += 1

rows = len(table.findAll('tr'))
rows -= endrows + 1 # there is a pernicious final row that begins with 'Rk' 

list_of_dicts = []
for row in range(rows):
    the_row = []
    try:
        table_row = table.findAll('tr')[row]
        for tr in table_row:
            value = tr.get_text()
            the_row.append(value)
        od = co.OrderedDict(zip(header,the_row))
        list_of_dicts.append(od)
    except AttributeError:
        continue 

df = pd.DataFrame(list_of_dicts)

Upvotes: 1

Related Questions