Extracting and Printing Table Headers and Data with Beautiful Soup with Python 2.7

Question

So I'm trying to scrape data from the table on the Michigan Department of Health and Human Services website using BeautifulSoup 4.0 and I don't know how to format it properly.

I have the code below written to get the and information from the website but I'm at a loss as how to format it so that it has the same appearance as the table on the website when I print it or save it as a .txt/ .csv file. I've looked around here and on a bunch of other websites for an answer but I'm not sure how to go forward with this. I'm very much a beginner so any help would be appreciated.

My code just prints a long list of either the table rows or table data:

import urllib2
import bs4
from bs4 import BeautifulSoup

url = "https://www.mdch.state.mi.us/osr/natality/BirthsTrends.asp"
page = urllib2.urlopen(url)
soup = BeautifulSoup((page), "html.parser")

table = soup.find("table")
rows = table.find_all("tr")

for tr in rows:
    tds = tr.find_all('td')
    print tds

The HTML that I'm looking at is below as well:

that part shows the years as headers and goes until 2015 and then the state and county data is further down:
   
and so on for the rest of the counties.
Again, any help is greatly appreciated.

   
   
     County

     
     2005
     




 
      Michigan
 
 127,518

t.m.adam · Accepted Answer

You need to store your table in a list

import urllib2
import bs4
from bs4 import BeautifulSoup

url = "https://www.mdch.state.mi.us/osr/natality/BirthsTrends.asp"
page = urllib2.urlopen(url)
soup = BeautifulSoup((page), "html.parser")

table = soup.find("table")
rows = table.find_all("tr")

table_contents = []   # store your table here
for tr in rows:
    if rows.index(tr) == 0 : 
        row_cells = [ th.getText().strip() for th in tr.find_all('th') if th.getText().strip() != '' ]  
    else : 
        row_cells = ([ tr.find('th').getText() ] if tr.find('th') else [] ) + [ td.getText().strip() for td in tr.find_all('td') if td.getText().strip() != '' ] 
    if len(row_cells) > 1 : 
        table_contents += [ row_cells ]

Now table_contents has the same structure and data as the table on the page.

Extracting and Printing Table Headers and Data with Beautiful Soup with Python 2.7

Answers (1)

Related Questions