duck00
duck00

Reputation: 71

Dealing with links inside table cells in Beautiful Soup

I'm following an online tutorial, but as usual I've gone off-piste and I'm trying to apply the lessons learned to my own project. All is going surprisingly well, however I've hit a problem and I haven't yet been able to find a solution.

https://pastebin.com/x4NjjTij

There are two problems with this (I mean, I'm sure you can find many more than two...):

In any cells that have a hyperlink in them, the data is replaced with "None". Example, this:

<tr>
 <td>192</td>
 <td>
     <a href="/mlb/player/4987">Júlio Soto</a>
 </td>
 <td>26</td>
 <td>2B</td>
 <td>
     <a href="/mlb/team/13">KC</a>
 </td>
 <td>108</td>
 <td>115</td>
</tr>

Gets output as:

192  None  25  2B  None  108  115

The weirder thing about this is that many of the headers also have hyperlinks but they work without an issue.

<tr>
    <th>#</th>
    <th>Name</th>
    <th><a href="/mlb/playerstats[...]">AGE</a></th>
    <th>Pos</th>
    <th>Team</th>
    <th><a href="/mlb/playerstats[...]">AB</a></th>
</tr>

outputs everything just fine. I've just this second whilst typing noticed the newline in the first code block compared to the second one, is that the deal-breaker here? Would I need to remove all newlines when scraping the data? If so, how?

The second problem, which having realised the newline issue above might be highly related, is that one column header is showing as "None" and it seems to be because it also includes a span with a down arrow (it's the column on the website that the data is currently sorted by, so it has the arrow to signify that the column is sorted, I'm sure you're familiar).

<th>
    <a href="/mlb/playerstats[...]">PA</a>
   
     <span aria-hidden="true" class="glyphicon glyphicon-chevron-down"></span>
</th>

This would seem to be the same/similar issue, do I just need to get rid of the line when reading the data in? How would I go about doing this? A simple df.rename(columns = {'None':'PA'}, inplace=True) works for now, but I'd like to know how to do it 'correctly'.

Upvotes: 0

Views: 59

Answers (1)

Feline
Feline

Reputation: 784

You can try doing the following:

# Bring in the data with associated warts
data = []
for tr in rows[1:-1]: 
    data.append(str(tr.text).strip().split("\n"))

# Dump it in a dataframe
df = pd.DataFrame(data)

# Forcefully remove the blank data
df.drop(
    axis=1,
    labels=[1, 3, 6, 8, 28],
    inplace=True
    )

# Add the headings
headings = [] 
for td in rows[0].find_all("th"): 
    headings.append(td.text.replace('\n', '').strip()) 
df.columns = headings

Upvotes: 1

Related Questions