Reputation: 198
My code so far looks like this:
from bs4 import BeautifulSoup
import csv
html = open("Greyhound Race and Breeding1.html").read()
soup = BeautifulSoup(html)
table = soup.find("table")
output_rows = []
for table_row in table.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
with open('output.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(output_rows)
This gets more than I want, I only want to get the part of the table that follows from a td with the title="order in which the dogs arrived at the finish". How can I modify my code to solve this?
My guess is that table = soup.find("table") should be modified so that I can find
<td title="order in which the dogs arrived at the finish">.
But I don't know how. Maybe I should somehow set table to be the parent of the td with the
<td title="order in which the dogs arrived at the finish">
<table>
<tr>
<td>I don't want this</td>
<td>Or this</td>
</tr>
</table>
<table>
<tr>
<td>I don't want this</td>
<td>Or this</td>
</tr>
</table>
<table>
<tr>
<td title="order in which the dogs arrived at the finish"> I want this and the rest of the document</td>
<td> More things I want</td>
</tr>
</table>
I almost got Jack Fleetings solution to work
html = open("Greyhound Race and Breeding1.html").read()
soup = BeautifulSoup(html)
#table = soup.find("table")["title": "order in which the dogs arrived at the finish"]
#table = str(soup.find("table",{"title": "order in which the dogs arrived at the finish"}))
table = soup.find("table")
for table in soup.select('table'):
if table.select_one('td[title="order in which the dogs arrived at the finish"]')is not None:
newTable = table
output_rows = []
for table_row in newTable.findAll("tr"):
columns = table_row.findAll("td")
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
with open("output8.csv", "w") as csvfile:
writer = csv.writer(csvfile)
writer.writerows(output_rows)
The problem is that it repeats the same row several times, but it is the correct table. I tried several times to correct this,but no luck. So I decided to switch to using pandas instead:
from bs4 import BeautifulSoup
import csv
import pandas as pd
df = pd.read_html("Greyhound Race and Breeding1.html")
#This shows how many tables there are
print (len(df))
#To find the right table, I bruteforced it by printing print(df[for each table]) #Turns out the table I was looking for was df[8]
print(df[8])
#Finally we move the table to a csv file
df[8].to_csv("Table.csv")
Upvotes: 0
Views: 61
Reputation: 24930
If I understand you correctly, you can use css selectors to do this:
for table in soup.select('table'):
target = table.select('td[title="order in which the dogs arrived at the finish"]')
if len(target)>0:
print(table)
If you know that only one table meets the requirement, you can use:
target = soup.select_one('td[title="order in which the dogs arrived at the finish"]')
print(target.findParent())
Output:
<table>
<tr>
<td title="order in which the dogs arrived at the finish"> I want this and the rest of the document</td>
<td> More things I want</td>
</tr>
</table>
Upvotes: 1