Dealing with links inside table cells in Beautiful Soup

Question

I'm following an online tutorial, but as usual I've gone off-piste and I'm trying to apply the lessons learned to my own project. All is going surprisingly well, however I've hit a problem and I haven't yet been able to find a solution.

https://pastebin.com/x4NjjTij

There are two problems with this (I mean, I'm sure you can find many more than two...):

In any cells that have a hyperlink in them, the data is replaced with "None". Example, this:


 192
 
     Júlio Soto
 
 26
 2B
 
     KC
 
 108
 115

Gets output as:

192  None  25  2B  None  108  115

The weirder thing about this is that many of the headers also have hyperlinks but they work without an issue.


    #
    Name
    AGE
    Pos
    Team
    AB

outputs everything just fine. I've just this second whilst typing noticed the newline in the first code block compared to the second one, is that the deal-breaker here? Would I need to remove all newlines when scraping the data? If so, how?

The second problem, which having realised the newline issue above might be highly related, is that one column header is showing as "None" and it seems to be because it also includes a span with a down arrow (it's the column on the website that the data is currently sorted by, so it has the arrow to signify that the column is sorted, I'm sure you're familiar).

PA

This would seem to be the same/similar issue, do I just need to get rid of the line when reading the data in? How would I go about doing this? A simple df.rename(columns = {'None':'PA'}, inplace=True) works for now, but I'd like to know how to do it 'correctly'.

Feline · Accepted Answer

You can try doing the following:

# Bring in the data with associated warts
data = []
for tr in rows[1:-1]: 
    data.append(str(tr.text).strip().split("
"))

# Dump it in a dataframe
df = pd.DataFrame(data)

# Forcefully remove the blank data
df.drop(
    axis=1,
    labels=[1, 3, 6, 8, 28],
    inplace=True
    )

# Add the headings
headings = [] 
for td in rows[0].find_all("th"): 
    headings.append(td.text.replace('
', '').strip()) 
df.columns = headings

Dealing with links inside table cells in Beautiful Soup

Answers (1)

Related Questions