Jensen Holm
Jensen Holm

Reputation: 63

Parsing table data from BeautifulSoup HTML Comment

So I am trying to get a table off of https://www.baseball-reference.com/register/team.cgi?id=9995d2a1, specifically the one labeled "Team Pitching", which is hidden in an html comment, preventing me from using pd.read_html() or another simpler method. I have gotten to the point where I have all of the data in a data frame, but my issue is that players with an asterisk in their name because they are left handed dissapear. Meaning their names turn to 'None', but I really need to remove the '*' so that the name reads.

This is what I did to get what I have so far with the 'None' as a name for lefties:

page = BeautifulSoup(requests.get('https://www.baseball-reference.com/register/team.cgi?id=b0a9f9bc').text, features = 'lxml')

tbls = []
for comment in page.find_all(text=lambda text: isinstance(text, Comment)):
        if comment.find("<table ") > 0:
            comment_soup = BeautifulSoup(comment, 'lxml')
            table = comment_soup.find("table")
            tbls.append(table)

def parse_row(row):
  return [str(x.string) for x in row.find_all('td')]

# pitching table
pitching_tbl = tbls[0]

# html text only used for finding names
html = BeautifulSoup(pitching_tbl.text, features = 'lxml')

rows = pitching_tbl.find_all('tr')
data = pd.DataFrame([parse_row(row) for row in rows])

What I would like to be able to do is loop through the text within the pitching_tbl text, and change it in place if there is an asterisk and use .replace('*', ''), and have the actual html within pitching_tbl be changed.

Output

any help is appriciated!

Upvotes: 0

Views: 64

Answers (2)

chitown88
chitown88

Reputation: 28564

So to get that, you need to change:

def parse_row(row):
  return [str(x.string) for x in row.find_all('td')]

to

def parse_row(row):
  return [str(x.text) for x in row.find_all('td')]

The reason you get None is because the '*' is not part of the <a> tag, so essentially there are 2 contents within the <td> element. If you use .text it'll join them.

So that handles the first issue of the None. The second issue of removing the *: I'm not sure do you actually want it removed from the html, or just simply removed in the dataframe you create, so I will show you both.

To change the actual html:

Here we just remove the '*' element from the <td> .contents list. This will alter the actual soup object, changing the html. And this will also then result in your dataframe showing without that too.

HTML Before:

enter image description here

HTML After - Notice the '*' no longer in the actual html:

enter image description here

Now if you're not interested in changing the html, but rather just grab the data and use pandas to manipulate (NOTE: like F.Hoque did, I would let pandas' .read_html() parse the table for you as it'll grab the headers as well, but this will still work with your code too. Either way, it would be the last step once you have data or in the other solution df. For his, you'd do df['Name'] = df['Name'].str.replace('*','', regex=True)):

import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd


page = BeautifulSoup(requests.get('https://www.baseball-reference.com/register/team.cgi?id=b0a9f9bc').text, features = 'lxml')

tbls = []
for comment in page.find_all(text=lambda text: isinstance(text, Comment)):
        if comment.find("<table ") > 0:
            comment_soup = BeautifulSoup(comment, 'lxml')
            table = comment_soup.find("table")
            tbls.append(table)

def parse_row(row):
  return [str(x.text) for x in row.find_all('td')]


# pitching table
pitching_tbl = tbls[0]

# html text only used for finding names
html = BeautifulSoup(pitching_tbl.text, features = 'lxml')

rows = pitching_tbl.find_all('tr')
data = pd.DataFrame([parse_row(row) for row in rows])
data[0] = data[0].str.replace('*','', regex=True)

**Using F.Hoque's solution, which is how I would do it. And then also, that might be nice extra column to add, so if it's there, why not add it?:

import pandas as pd
import requests
from bs4 import BeautifulSoup, Comment
import numpy as np

url='https://www.baseball-reference.com/register/team.cgi?id=9995d2a1'
req=requests.get(url)
soup=BeautifulSoup(req.text,'lxml')
df = pd.read_html([x for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_team_pitching"' in x][0])[0]

df['Handedness'] = np.where(df['Name'].str.contains('\*'), 'L', 'R')

df['Name'] = df['Name'].str.replace('*','', regex=True)    
print(df)

Output:

      Rk                      Name   Age  W  ...   SO9  SO/W  Notes  Handedness
0    1.0  Logan Bursick-Harrington  21.0  0  ...  15.8  1.00    NaN           R
1    2.0                 Cylis Cox  19.0  1  ...  11.6  1.50    NaN           L
2    3.0           Travis Densmore  21.0  0  ...  14.4  8.00    NaN           L
3    4.0             Dylan Freeman  22.0  1  ...  14.6  4.33    NaN           R
4    5.0               Zach Hopman  22.0  0  ...  11.4  1.14    NaN           L
5    6.0            Eamon Horwedel  22.0  1  ...   6.4  1.00    NaN           R
6    7.0             Tyler Johnson  19.0  0  ...  10.8  4.00    NaN           R
7    8.0               Trent Jones  20.0  0  ...  12.4  5.50    NaN           R
8    9.0              Tanner Knapp  21.0  1  ...   4.8  0.63    NaN           R
9   10.0              Mason Majors  22.0  1  ...  12.3  1.67    NaN           R
10  11.0               Mason Meeks  21.0  0  ...   5.4  1.50    NaN           R
11  12.0            Sam Nagelvoort  19.0  0  ...   9.0  0.40    NaN           R
12  13.0              Tyler Nichol  20.0  0  ...   0.0  0.00    NaN           R
13  14.0                Cole Russo  19.0  0  ...   0.0   NaN    NaN           R
14  15.0               Kyle Salley  22.0  0  ...   9.0  0.40    NaN           L
15  16.0               Noah Stants  21.0  0  ...  11.4  1.60    NaN           R
16  17.0          Quinn Waterhouse  21.0  0  ...  18.0  4.00    NaN           L
17  18.0              Nick Weyrich  19.0  0  ...  11.6  1.50    NaN           R
18  19.0              Adam Wheaton  23.0  0  ...  12.6  2.80    NaN           R
19   NaN                19 Players  20.9  5  ...  10.7  1.55    NaN           R

[20 rows x 33 columns]

Upvotes: 1

Md. Fazlul Hoque
Md. Fazlul Hoque

Reputation: 16187

The desired table data is in html comment.So You can invoke beautifulsoup built-in package which is Comment with lambda function to grab data.

import pandas as pd
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
url='https://www.baseball-reference.com/register/team.cgi?id=9995d2a1'
req=requests.get(url)
soup=BeautifulSoup(req.text,'lxml')
df = pd.read_html([x for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_team_pitching"' in x][0])[0]
print(df)

Output:

 Rk                      Name   Age  W  L   W-L%  ...    H9   HR9   BB9   SO9  SO/W  Notes
0    1.0  Logan Bursick-Harrington  21.0  0  2  0.000  ...   4.5   0.0  15.8  15.8  1.00    NaN
1    2.0                Cylis Cox*  19.0  1  0  1.000  ...  23.1   0.0   7.7  11.6  1.50    NaN
2    3.0          Travis Densmore*  21.0  0  1  0.000  ...   7.2   0.0   1.8  14.4  8.00    NaN
3    4.0             Dylan Freeman  22.0  1  0  1.000  ...  13.5   1.1   3.4  14.6  4.33    NaN
4    5.0              Zach Hopman*  22.0  0  1  0.000  ...  12.8   0.0   9.9  11.4  1.14    NaN
5    6.0            Eamon Horwedel  22.0  1  0  1.000  ...   9.0   0.0   6.4   6.4  1.00    NaN
6    7.0             Tyler Johnson  19.0  0  0    NaN  ...   5.4   0.0   2.7  10.8  4.00    NaN
7    8.0               Trent Jones  20.0  0  0    NaN  ...  14.6   1.1   2.3  12.4  5.50    NaN
8    9.0              Tanner Knapp  21.0  1  1  0.500  ...  11.6   0.0   7.7   4.8  0.63    NaN
9   10.0              Mason Majors  22.0  1  0  1.000  ...   4.9   0.0   7.4  12.3  1.67    NaN
10  11.0               Mason Meeks  21.0  0  1  0.000  ...   6.3   0.9   3.6   5.4  1.50    NaN
11  12.0            Sam Nagelvoort  19.0  0  1  0.000  ...  18.0   2.3  22.5   9.0  0.40    NaN
12  13.0              Tyler Nichol  20.0  0  0    NaN  ...  27.0   0.0  27.0   0.0  0.00    NaN
13  14.0                Cole Russo  19.0  0  0    NaN  ...  27.0  13.5   0.0   0.0   NaN    NaN
14  15.0              Kyle Salley*  22.0  0  1  0.000  ...   9.0   2.3  22.5   9.0  0.40    NaN
15  16.0               Noah Stants  21.0  0  0    NaN  ...   4.3   1.4   7.1  11.4  1.60    NaN
16  17.0         Quinn Waterhouse*  21.0  0  0    NaN  ...   4.5   0.0   4.5  18.0  4.00    NaN
17  18.0              Nick Weyrich  19.0  0  0    NaN  ...   6.4   1.3   7.7  11.6  1.50    NaN
18  19.0              Adam Wheaton  23.0  0  1  0.000  ...  11.7   1.8   4.5  12.6  2.80    NaN
19   NaN                19 Players  20.9  5  9  0.357  ...   9.2   0.8   6.9  10.7  1.55    NaN

[20 rows x 32 columns]

Upvotes: 1

Related Questions