kr21
kr21

Reputation: 5

Parsing an html table

To start off here's my current code in its entirety:

import urllib
from BeautifulSoup import BeautifulSoup
import sgmllib
import re

page = 'http://www.sec.gov/Archives/edgar/data/\
8177/000114036111018563/form10k.htm'

sock = urllib.urlopen(page)
raw = sock.read()
soup = BeautifulSoup(raw)

tablelist = soup.findAll('table')

class MyParser(sgmllib.SGMLParser):

def parse(self, segment):
    self.feed(segment)
    self.close()

def __init__(self, verbose=0):
    sgmllib.SGMLParser.__init__(self, verbose)
    self.descriptions = []
    self.inside_td_element = 0
    self.starting_description = 0

def start_td(self, attributes):
    for name, value in attributes:
        if name == "valign":
            self.inside_td_element = 1
            self.starting_description = 1
        else:
            self.inside_td_element = 1
            self.starting_description = 1

def end_td(self):
    self.inside_td_element = 0

def handle_data(self, data):
    if self.inside_td_element:
        if self.starting_description:
            self.descriptions.append(data)
            self.starting_description = 0
        else:
            self.descriptions[-1] += data

def get_descriptions(self):
    return self.descriptions

counter = 0
trlist = []
dtablelist = []

while counter < len(tablelist):
    trsegment = tablelist[counter].findAll('tr')
    trlist.append(trsegment)
    strsegment = str(trsegment)
    myparser = MyParser()
    myparser.parse(strsegment)
    sub = myparser.get_descriptions()
    dtablelist.append(sub)
    counter = counter + 1

ex = []

dtablelist = [s for s in dtablelist if s != ex]

So what I want to accomplish is take all the tables from an html document, then reprint them onto an Excel spreadsheet. So when I create trlist the output looks like this:

print trlist[1]
[<tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-    SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">&#160;</font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Title of each class</font></div>
</td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Name of exchange</font></td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">&#160;</font></td>
</tr>, <tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman">&#160;</font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="DISPLAY: inline; FONT-WEIGHT: bold">Common Stock, par value</font>    </font></div>
</td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="FONT-WEIGHT: bold"><font style="FONT-WEIGHT: bold"><     <font style="FONT-WEIGHT: bold">NASDAQ Global Market</font></font></font></font></div>
</div>
</td>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman">&#160;</font></td>
</tr>,...

As you can see each item in trlist is each individual row ( . . . ) of the table which is what I want. But when I run each trlist item through my sgmllib parser to retrieve the contents between the tags I get this output:

print dtablelist[1]
['\nTitle of each class\n', 'Name of exchange', '\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n', '\n$1.00 per share\n']

As you can see, the output is each of the contents as their own individual string, instead of a list of the contents of each table row (). So essentially I want the output:

[['\nTitle of each class\n', 'Name of exchange'], ['\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n'], ['\n$1.00 per share\n']]

Is it because I have to turn trlist into a string before I parse it with MyParser? Does anyone know any way around this, allowing me to parse lists within lists (aka Inception shit)?

Upvotes: 0

Views: 1208

Answers (2)

schmijos
schmijos

Reputation: 8695

If somebody is searching for a solution of the same problem but is using python 3:

You don't have to use an external library for parsing an HTML table even if you are using python 3. There the SGMLParser class was replaced by HTMLParser from html.parser. I've written code for a simple derived HTMLParser class. It is here in a github repo. It simply does remember the current scope of a <td>, <tr> or <table> tag. The advantages over using etree are that it runs correctly on non-xml-compliant html and that it doesn't use external libraries.

You can use that class (here named HTMLTableParser) the following way:

import urllib.request
from html_table_parser import HTMLTableParser

target = 'http://www.twitter.com'

# get website content
req = urllib.request.Request(url=target)
f = urllib.request.urlopen(req)
xhtml = f.read().decode('utf-8')

# instantiate the parser and feed it
p = HTMLTableParser()
p.feed(xhtml)
print(p.tables)

The output of this is a list of 2D-lists representing tables. It looks maybe like this:

[[['   ', ' Anmelden ']],
 [['Land', 'Code', 'Für Kunden von'],
  ['Vereinigte Staaten', '40404', '(beliebig)'],
  ['Kanada', '21212', '(beliebig)'],
  ...
  ['3424486444', 'Vodafone'],
  ['  Zeige SMS-Kurzwahlen für andere Länder ']]]

Upvotes: 1

odie5533
odie5533

Reputation: 562

Using lxml.html:

>>> import lxml.html
>>> data = ["<tr><td>test</td><td>help</td></tr>", "<tr><td>data1</td><td>data2</td></tr>"]
>>> [lxml.html.fromstring(tr).xpath(".//text()") for tr in data]
[['test', 'help'], ['data1', 'data2']]

And here is some more complete code. It stores the text in a list containing a list of tables, and each table has a list of tr's, and each tr has a list of all the text.

import urllib
import lxml.html

data = urllib.urlopen('http://www.sec.gov/Archives/edgar/data/8177/000114036111018563/form10k.htm').read()
tree = lxml.html.fromstring(data)

tables = []
for tbl in tree.iterfind('.//table'):
    tele = []
    tables.append(tele)
    for tr in tbl.iterfind('.//tr'):
        text = [e.strip() for e in tr.xpath('.//text()') if len(e.strip()) > 0]
        tele.append(text)

print tables

Hope this helps, cheers!

Upvotes: 2

Related Questions