Reputation: 43
Running into some problems using BS4 to extract specific elements. This is taken from the Texas Department of Corrections Executed Inmates page.
I've attached a screenshot for better understanding.
Within each tr tag, there are multiple td tags containing text about First Name, Last Name, TDCJ Number, Age, Date, etc.
How can I get BS4 to skip over the first tr tag (the first tr tag are the column names) and for each subsequent tr tag, extract the text from the td tags?
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
def main():
gettabledata()
lstofinmates = list()
def gettabledata():
with urlopen('https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html') as response:
soup = BeautifulSoup(response, 'html.parser')
with open('exinmates.csv', 'w', newline='') as output_file:
inmate_file_writer = csv.DictWriter(output_file,
fieldnames=['First Name', 'Last Name', 'Execution Number',
'Last Statement', 'TDCJ Number', 'Age', 'Date Executed', 'Race',
'County'],
extrasaction='ignore',
delimiter=',', quotechar='"')
inmate_file_writer.writeheader()
table = soup.find('table').find('tbody')
print (table)
if __name__ == '__main__':
main()
I'm thinking of creating of LOD structure where each dictionary corresponds to an inmate information, and the text from the td fields are pushed into the dictionary, and each dictionary is appended into a list. The problem is that I can't find a way to skip the first tr tag and how to iterate over the rest of the tr tags to append them into a dictionary. Any suggestions/help? Thanks!
Upvotes: 1
Views: 2838
Reputation: 387745
Here is something to get you started:
from bs4 import BeautifulSoup
html = '''<h1>Executed Offenders</h1>
<table class="os" width="100%">
<tbody>
<tr><th scope="col">Execution</th><th scope="col">Link</th><th scope="col">Link</th><th scope="col">Last Name</th><th scope="col">First Name</th><th scope="col">TDCJ Number</th><th scope="col">Age</th><th scope="col">Date</th><th scope="col">Race</th><th scope="col">County</th</tr>
<tr><td>542</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Bigby</td><td>James</td><td>997</td><td>61</td><td>3/14/2017</td><td>White</td><td>Tarrant</td></tr>
<tr><td>541</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Ruiz</td><td>Rolando</td><td>999145</td><td>44</td><td>3/07/2017</td><td>Hispanic</td><td>Bexar</td></tr>
<tr><td>540</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Edwards</td><td>Terry</td><td>999463</td><td>43</td><td>1/26/2017</td><td>Black</td><td>Dallas</td></tr>
<tr><td>539</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Wilkins</td><td>Christopher</td><td>999533</td><td>48</td><td>01/11/2017</td><td>White</td><td>Tarrant</td></tr>
<tr><td>538</td><td><a href="#">Offender Information</a></td><td><a href="#">Last Statement</a></td><td>Fuller</td><td>Barney</td><td>999481</td><td>58</td><td>10/05/2016</td><td>White</td><td>Houston</td></tr>
</tbody>
</table>'''
soup = BeautifulSoup(html, 'html.parser')
rows = iter(soup.find('table').find_all('tr'))
# skip first row
next(rows)
for row in rows:
for cell in row.find_all('td'):
print(cell)
print()
Upvotes: 3