Reputation: 353
I'm stuck with a BeautifulSoup problem that I think is simple but I can't seem to solve. It is about extracting each td from the following table to create a loop and a list:
<table class="tabla-clasificacion-home marratua tablageneral tabla-actas">
<thead>
<tr>
<th scope="col">Team</th>
<th scope="col">Name</th>
<th scope="col">Number</th>
<th scope="col">Tipo</th>
<th scope="col">Motivo</th>
<th scope="col">Minute</th>
<th scope="col">Bloque</th>
</tr>
</thead>
<tbody>
<tr>
<td>Barcelona</td>
<td>Player 1</td>
<td>16</td>
<td>Tarjeta Amarilla</td>
<td>Derribar a un contrario en la disputa del balón</td>
<td>88</td>
<td>Segundo tiempo</td>
</tr> <tr>
<td>Real Madrid</td>
<td>Player 2</td>
<td>8</td>
<td>Tarjeta Amarilla</td>
<td>Sujetar a un adversario impidiendo su avance.</td>
<td>12</td>
<td>Primer tiempo</td>
</tr>
</tbody>
</table>
What I need is to create a dictionary with some elements of each tr to create a dataframe later. I would like to have a list with:
As you can see, there are some tds that I don't need and I'd also like to 'jump' on them for my final df.
I've tried with this code (I only put a simplified example) but it doesn't work because I always take the name of the 1st team:
tabla = amonestaciones.find('table', class_='tabla-clasificacion-home marratua tablageneral tabla-actas')
rows = tabla.find_all('tr')
for row in rows:
team = row.find('td')
name = row.findNext('td')
lista = {
"Team": team,
"Name": name
}
This is the output I get (I also would like to remove the code but if I try .text or .get_text() I have the error 'NoneType' object has no attribute 'text'):
{'Team': <td>Real Madrid</td>, 'Name': <td>Real Madrid</td>}
I sense that I'm very close to the solution but I am stuck and I can't move forward. Thanks in advance for your help!
Upvotes: 1
Views: 1513
Reputation: 20042
If you feel like learning something new, you don't even need bs4
(well, sort of). All you need is pandas
(you get a dataframe out of the box) to get this:
- ----------- -------- -- ---------------- ----------------------------------------------- -- --------------
0 Barcelona Player 1 16 Tarjeta Amarilla Derribar a un contrario en la disputa del balón 88 Segundo tiempo
1 Real Madrid Player 2 8 Tarjeta Amarilla Sujetar a un adversario impidiendo su avance. 12 Primer tiempo
- ----------- -------- -- ---------------- ----------------------------------------------- -- --------------
With this:
import pandas as pd
from tabulate import tabulate
sample_html = """
<table class="tabla-clasificacion-home marratua tablageneral tabla-actas">
<thead>
<tr>
<th scope="col">Team</th>
<th scope="col">Name</th>
<th scope="col">Number</th>
<th scope="col">Tipo</th>
<th scope="col">Motivo</th>
<th scope="col">Minute</th>
<th scope="col">Bloque</th>
</tr>
</thead>
<tbody>
<tr>
<td>Barcelona</td>
<td>Player 1</td>
<td>16</td>
<td>Tarjeta Amarilla</td>
<td>Derribar a un contrario en la disputa del balón</td>
<td>88</td>
<td>Segundo tiempo</td>
</tr> <tr>
<td>Real Madrid</td>
<td>Player 2</td>
<td>8</td>
<td>Tarjeta Amarilla</td>
<td>Sujetar a un adversario impidiendo su avance.</td>
<td>12</td>
<td>Primer tiempo</td>
</tr>
</tbody>
</table>
"""
df = pd.read_html(sample_html, flavor="bs4")
df = pd.concat(df)
print(tabulate(df))
df.to_csv("your_table.csv", index=False)
The code also dumps your table to a .csv
file:
Upvotes: 2
Reputation: 4874
You have to first retrieve all the rows, then call row.findAll('td')
on each row and store it in a variable. After which you can use an index on it to retrieve the required columns and append it as a dictionary to a list:
table = soup.find('table', class_='tabla-clasificacion-home marratua tablageneral tabla-actas')
tbody = table.find('tbody')
tr = tbody.findAll('tr')
lista = []
for row in tr:
td = row.findAll('td')
lista.append({ 'Team': td[0].text, "Name": td[1].text, "Number": td[2].text, "Minute": td[5].text })
Upvotes: 2