nokvk
nokvk

Reputation: 353

Beautifulsoup: extracting td list in table

I'm stuck with a BeautifulSoup problem that I think is simple but I can't seem to solve. It is about extracting each td from the following table to create a loop and a list:

<table class="tabla-clasificacion-home marratua tablageneral tabla-actas">
<thead>
<tr>
<th scope="col">Team</th>
<th scope="col">Name</th>
<th scope="col">Number</th>
<th scope="col">Tipo</th>
<th scope="col">Motivo</th>
<th scope="col">Minute</th>
<th scope="col">Bloque</th>
</tr>
</thead>
<tbody>
<tr>
<td>Barcelona</td>
<td>Player 1</td>
<td>16</td>
<td>Tarjeta Amarilla</td>
<td>Derribar a un contrario en la disputa del balón</td>
<td>88</td>
<td>Segundo tiempo</td>
</tr> <tr>
<td>Real Madrid</td>
<td>Player 2</td>
<td>8</td>
<td>Tarjeta Amarilla</td>
<td>Sujetar a un adversario impidiendo su avance.</td>
<td>12</td>
<td>Primer tiempo</td>
</tr>
</tbody>
</table>

What I need is to create a dictionary with some elements of each tr to create a dataframe later. I would like to have a list with:

As you can see, there are some tds that I don't need and I'd also like to 'jump' on them for my final df.

I've tried with this code (I only put a simplified example) but it doesn't work because I always take the name of the 1st team:

tabla = amonestaciones.find('table', class_='tabla-clasificacion-home marratua tablageneral tabla-actas')

rows = tabla.find_all('tr')

for row in rows:
    team = row.find('td')
    name = row.findNext('td')
    lista = {
        "Team": team,
        "Name": name
    }

This is the output I get (I also would like to remove the code but if I try .text or .get_text() I have the error 'NoneType' object has no attribute 'text'):

{'Team': <td>Real Madrid</td>, 'Name': <td>Real Madrid</td>}

I sense that I'm very close to the solution but I am stuck and I can't move forward. Thanks in advance for your help!

Upvotes: 1

Views: 1513

Answers (2)

baduker
baduker

Reputation: 20042

If you feel like learning something new, you don't even need bs4 (well, sort of). All you need is pandas (you get a dataframe out of the box) to get this:

-  -----------  --------  --  ----------------  -----------------------------------------------  --  --------------
0  Barcelona    Player 1  16  Tarjeta Amarilla  Derribar a un contrario en la disputa del balón  88  Segundo tiempo
1  Real Madrid  Player 2   8  Tarjeta Amarilla  Sujetar a un adversario impidiendo su avance.    12  Primer tiempo
-  -----------  --------  --  ----------------  -----------------------------------------------  --  --------------

With this:

import pandas as pd
from tabulate import tabulate

sample_html = """
<table class="tabla-clasificacion-home marratua tablageneral tabla-actas">
<thead>
<tr>
<th scope="col">Team</th>
<th scope="col">Name</th>
<th scope="col">Number</th>
<th scope="col">Tipo</th>
<th scope="col">Motivo</th>
<th scope="col">Minute</th>
<th scope="col">Bloque</th>
</tr>
</thead>
<tbody>
<tr>
<td>Barcelona</td>
<td>Player 1</td>
<td>16</td>
<td>Tarjeta Amarilla</td>
<td>Derribar a un contrario en la disputa del balón</td>
<td>88</td>
<td>Segundo tiempo</td>
</tr> <tr>
<td>Real Madrid</td>
<td>Player 2</td>
<td>8</td>
<td>Tarjeta Amarilla</td>
<td>Sujetar a un adversario impidiendo su avance.</td>
<td>12</td>
<td>Primer tiempo</td>
</tr>
</tbody>
</table>
"""

df = pd.read_html(sample_html, flavor="bs4")
df = pd.concat(df)
print(tabulate(df))
df.to_csv("your_table.csv", index=False)

The code also dumps your table to a .csv file:

enter image description here

Upvotes: 2

Amal K
Amal K

Reputation: 4874

You have to first retrieve all the rows, then call row.findAll('td') on each row and store it in a variable. After which you can use an index on it to retrieve the required columns and append it as a dictionary to a list:

table = soup.find('table', class_='tabla-clasificacion-home marratua tablageneral tabla-actas')
tbody = table.find('tbody')
tr = tbody.findAll('tr')
lista = []
for row in tr:
    td = row.findAll('td')
    lista.append({ 'Team': td[0].text, "Name": td[1].text, "Number": td[2].text, "Minute": td[5].text })



Upvotes: 2

Related Questions