Kyle Kramer
Kyle Kramer

Reputation: 59

Turn an HTML table into a CSV file

How do I turn a table like this--batting gamelogs table--into a CSV file using Python and BeautifulSoup?

I want the first header where it says Rk, Gcar, Gtm, etc. and not any of the other headers within the table (the ones for each month of the season).

Here is the code I have so far:

from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv

def stir_the_soup():
    player_links = open('player_links.txt', 'r')
    player_ID_nums = open('player_ID_nums.txt', 'r')
    id_nums = [x.rstrip('\n') for x in player_ID_nums]
    idx = 0
    for url in player_links:
        print url
        soup = BeautifulSoup(urlopen(url), "lxml")
        p_type = ""
        if url[-12] == 'p':
            p_type = "pitching"
        elif url[-12] == 'b':
            p_type = "batting" 
        table = soup.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']== (p_type + "_gamelogs"))
        header = [[val.text.encode('utf8') for val in table.find_all('thead')]]
        rows = []
        for row in table.find_all('tr'):
            rows.append([val.text.encode('utf8') for val in row.find_all('th')])
            rows.append([val.text.encode('utf8') for val in row.find_all('td')])
        with open("%s.csv" % id_nums[idx], 'wb') as f:
            writer = csv.writer(f)
            writer.writerow(header)
            writer.writerows(row for row in rows if row)
        idx += 1
    player_links.close()

if __name__ == "__main__":
    stir_the_soup()

The id_nums list contains all of the id numbers for each player to use as the names for the separate CSV files.

For each row, the leftmost cell is a tag and the rest of the row is tags. In addition to the header how do I put that into one row?

Upvotes: 0

Views: 3213

Answers (2)

MattR
MattR

Reputation: 5146

this code gets you the big table of stats, which is what I think you want. Make sure you have lxml, beautifulsoup4 and pandas installed.

df = pd.read_html(r'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010')
print(df[4])

Here is the output of first 5 rows. You may need to clean it slightly as I don't know what your exact endgoal is:

df[4].head(5)
    Rk  Gcar    Gtm Date    Tm  Unnamed: 5  Opp Rslt    Inngs   PA  ... CS  BA  OBP SLG OPS BOP aLI WPA RE24    Pos
0   1   66  2 (1)   Apr 6   ARI NaN SDP L,3-6   7-8 1   ... 0   1.000   1.000   1.000   2.000   9   .94 0.041   0.51    PH
1   2   67  3   Apr 7   ARI NaN SDP W,5-3   7-8 1   ... 0   .500    .500    .500    1.000   9   1.16    -0.062  -0.79   PH
2   3   68  4   Apr 9   ARI NaN PIT W,9-1   8-GF    1   ... 0   .667    .667    .667    1.333   2   .00 0.000   0.13    PH SS
3   4   69  5   Apr 10  ARI NaN PIT L,3-6   CG  4   ... 0   .500    .429    .500    .929    2   1.30    -0.040  -0.37   SS
4   5   70  7 (1)   Apr 13  ARI @   LAD L,5-9   6-6 1   ... 0   .429    .375    .429    .804    9   1.52    -0.034  -0.46   PH

to select certain columns within this DataFrame: df[4]['COLUMN_NAME_HERE'].head(5)

Example: df[4]['Gcar']

Also, if doing df[4] is getting annoying you could always just switch to another dataframe df2=df[4]

Upvotes: 1

dashiell
dashiell

Reputation: 812

import pandas as pd
from bs4 import BeautifulSoup
import urllib2

url = 'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010'
html=urllib2.urlopen(url)

bs = BeautifulSoup(html,'lxml')

table = str(bs.find('table',{'id':'batting_gamelogs'}))

dfs = pd.read_html(table)

This uses Pandas, which is pretty useful for stuff like this. It also puts it in a pretty reasonable format to do other operations on.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html

Upvotes: 0

Related Questions