Reputation: 23
So I'm trying to scrape data from the table on the Michigan Department of Health and Human Services website using BeautifulSoup 4.0 and I don't know how to format it properly.
I have the code below written to get the and information from the website but I'm at a loss as how to format it so that it has the same appearance as the table on the website when I print it or save it as a .txt/ .csv file. I've looked around here and on a bunch of other websites for an answer but I'm not sure how to go forward with this. I'm very much a beginner so any help would be appreciated.
My code just prints a long list of either the table rows or table data:
import urllib2
import bs4
from bs4 import BeautifulSoup
url = "https://www.mdch.state.mi.us/osr/natality/BirthsTrends.asp"
page = urllib2.urlopen(url)
soup = BeautifulSoup((page), "html.parser")
table = soup.find("table")
rows = table.find_all("tr")
for tr in rows:
tds = tr.find_all('td')
print tds
The HTML that I'm looking at is below as well:
<table border=0 cellpadding=3 cellspacing=0 width=640 align="center">
<thead style="display: table-header-group;">
<tr height=18 align="center">
<th height=35 align="left" colspan="2">County</th>
<th height="35" align="right">
2005
</th>
that part shows the years as headers and goes until 2015 and then the state and county data is further down:
<tr height="40" >
<th class="LeftAligned" colspan="2">Michigan</th>
<td>
127,518
</td>
and so on for the rest of the counties. Again, any help is greatly appreciated.
Upvotes: 2
Views: 5062
Reputation: 15376
You need to store your table in a list
import urllib2
import bs4
from bs4 import BeautifulSoup
url = "https://www.mdch.state.mi.us/osr/natality/BirthsTrends.asp"
page = urllib2.urlopen(url)
soup = BeautifulSoup((page), "html.parser")
table = soup.find("table")
rows = table.find_all("tr")
table_contents = [] # store your table here
for tr in rows:
if rows.index(tr) == 0 :
row_cells = [ th.getText().strip() for th in tr.find_all('th') if th.getText().strip() != '' ]
else :
row_cells = ([ tr.find('th').getText() ] if tr.find('th') else [] ) + [ td.getText().strip() for td in tr.find_all('td') if td.getText().strip() != '' ]
if len(row_cells) > 1 :
table_contents += [ row_cells ]
Now table_contents
has the same structure and data as the table on the page.
Upvotes: 2