icypy
icypy

Reputation: 3202

Parsing HTML a convoluted Table w/ BeautifulSoup

I am trying to create a csv file from NOAA data from their http://www.srh.noaa.gov/data/obhistory/PAFA.html.

I tried working with the table tag,but it failed. So I am trying to do it by identifying <tr> on each line. So this is my code:

#This script should take table context from URL and save new data into a CSV file.
noaa = urllib2.urlopen("http://www.srh.noaa.gov/data/obhistory/PAFA.html").read()
soup = BeautifulSoup(noaa)

#Iterate from lines 7 to 78 and extract the text in each line. I probably would like     
#space delimited between each text
#for i in range(7, 78, 1):
 rows = soup.findAll('tr')[i]
 for tr in rows:
    for n in range(0, 15, 1):
       cols = rows.findAll('td')[n]
       for td in cols[n]:
       print td.find(text=true)....(match.group(0), match.group(2), match.group(3), ... 
       match.group(15)

At the moment some stuff is working as expected some is not, and the last part I am not sure how to stitch the way I would like it.

Ok so I took what "That1guy " suggested, and tried to extend it to the CSV component. So:

import urllib2 as urllib
from bs4 import BeautifulSoup
from time import localtime, strftime
import csv
url = 'http://www.srh.noaa.gov/data/obhistory/PAFA.html'
file_pointer = urllib.urlopen(url)
soup = BeautifulSoup(file_pointer)

table = soup('table')[3]
table_rows = table.findAll('tr')
row_count = 0
for table_row in table_rows:
 row_count += 1
 if row_count < 4:
    continue

  date = table_row('td')[0].contents[0]
  time = table_row('td')[1].contents[0]
  wind = table_row('td')[2].contents[0]

  print date, time, wind
  with open("/home/eyalak/Documents/weather/weather.csv", "wb") as f:
   writer = csv.writer(f)
   print date, time, wind
   writer.writerow( ('Title 1', 'Title 2', 'Title 3') )
   writer.writerow(str(time)+str(wind)+str(date)+'\n')
 if row_count == 74:
    print "74"
    break

The printed result is fine, it is the file that is not. I get:

Title   1,Title 2,Title 3
0,5,:,5,3,C,a,l,m,0,8,"

The problems in the CSV file created are:

  1. The title is broken into the wrong columns;column 2, has "1,Title" versus "title 2"
  2. The data is comma delineated in the wrong places
  3. As The script writes new lines it over writes on the previous one, instead of appending from the bottom.

Any thoughts?

Upvotes: 1

Views: 284

Answers (1)

That1Guy
That1Guy

Reputation: 7233

This worked for me:

url = 'http://www.srh.noaa.gov/data/obhistory/PAFA.html'
file_pointer = urllib.urlopen(url)
soup = BeautifulSoup(file_pointer)

table = soup('table')[3]
table_rows = table.findAll('tr')
row_count = 0
for table_row in table_rows:
    row_count += 1
    if row_count < 4:
        continue

    date = table_row('td')[0].contents[0]
    time = table_row('td')[1].contents[0]
    wind = table_row('td')[2].contents[0]

    print date, time, wind

    if row_count == 74:
        break

This code obviously only returns the first 3 cells of each row, but you get the idea. Also, note some empty cells. In these cases, to make sure they're populated (or else probably receive an IndexError), I would check the length of each row before grabbing .contents. ie:

if len(table_row('td')[offset]) > 0:
    variable = table_row('td')[offset].contents[0]

This will ensure the cell is populated and you will avoid IndexErrors

Upvotes: 2

Related Questions