Parsing an html table

Question

To start off here's my current code in its entirety:

import urllib
from BeautifulSoup import BeautifulSoup
import sgmllib
import re

page = 'http://www.sec.gov/Archives/edgar/data/\
8177/000114036111018563/form10k.htm'

sock = urllib.urlopen(page)
raw = sock.read()
soup = BeautifulSoup(raw)

tablelist = soup.findAll('table')

class MyParser(sgmllib.SGMLParser):

def parse(self, segment):
    self.feed(segment)
    self.close()

def __init__(self, verbose=0):
    sgmllib.SGMLParser.__init__(self, verbose)
    self.descriptions = []
    self.inside_td_element = 0
    self.starting_description = 0

def start_td(self, attributes):
    for name, value in attributes:
        if name == "valign":
            self.inside_td_element = 1
            self.starting_description = 1
        else:
            self.inside_td_element = 1
            self.starting_description = 1

def end_td(self):
    self.inside_td_element = 0

def handle_data(self, data):
    if self.inside_td_element:
        if self.starting_description:
            self.descriptions.append(data)
            self.starting_description = 0
        else:
            self.descriptions[-1] += data

def get_descriptions(self):
    return self.descriptions

counter = 0
trlist = []
dtablelist = []

while counter < len(tablelist):
    trsegment = tablelist[counter].findAll('tr')
    trlist.append(trsegment)
    strsegment = str(trsegment)
    myparser = MyParser()
    myparser.parse(strsegment)
    sub = myparser.get_descriptions()
    dtablelist.append(sub)
    counter = counter + 1

ex = []

dtablelist = [s for s in dtablelist if s != ex]

So what I want to accomplish is take all the tables from an html document, then reprint them onto an Excel spreadsheet. So when I create trlist the output looks like this:

print trlist[1]
[
 

Title of each class

Name of exchange
 
, 
 

Common Stock, par value    



<     NASDAQ Global Market


 
,...

As you can see each item in trlist is each individual row ( . . . ) of the table which is what I want. But when I run each trlist item through my sgmllib parser to retrieve the contents between the tags I get this output:

print dtablelist[1]
['
Title of each class
', 'Name of exchange', '
Common Stock, par value
', '

NASDAQ Global Market

', '
$1.00 per share
']

As you can see, the output is each of the contents as their own individual string, instead of a list of the contents of each table row (). So essentially I want the output:

[['
Title of each class
', 'Name of exchange'], ['
Common Stock, par value
', '

NASDAQ Global Market

'], ['
$1.00 per share
']]

Is it because I have to turn trlist into a string before I parse it with MyParser? Does anyone know any way around this, allowing me to parse lists within lists (aka Inception shit)?

odie5533 · Accepted Answer

Using lxml.html:

>>> import lxml.html
>>> data = ["testhelp", "data1data2"]
>>> [lxml.html.fromstring(tr).xpath(".//text()") for tr in data]
[['test', 'help'], ['data1', 'data2']]

And here is some more complete code. It stores the text in a list containing a list of tables, and each table has a list of tr's, and each tr has a list of all the text.

import urllib
import lxml.html

data = urllib.urlopen('http://www.sec.gov/Archives/edgar/data/8177/000114036111018563/form10k.htm').read()
tree = lxml.html.fromstring(data)

tables = []
for tbl in tree.iterfind('.//table'):
    tele = []
    tables.append(tele)
    for tr in tbl.iterfind('.//tr'):
        text = [e.strip() for e in tr.xpath('.//text()') if len(e.strip()) > 0]
        tele.append(text)

print tables

Hope this helps, cheers!

Parsing an html table

Answers (2)

Related Questions