Reputation: 5
To start off here's my current code in its entirety:
import urllib
from BeautifulSoup import BeautifulSoup
import sgmllib
import re
page = 'http://www.sec.gov/Archives/edgar/data/\
8177/000114036111018563/form10k.htm'
sock = urllib.urlopen(page)
raw = sock.read()
soup = BeautifulSoup(raw)
tablelist = soup.findAll('table')
class MyParser(sgmllib.SGMLParser):
def parse(self, segment):
self.feed(segment)
self.close()
def __init__(self, verbose=0):
sgmllib.SGMLParser.__init__(self, verbose)
self.descriptions = []
self.inside_td_element = 0
self.starting_description = 0
def start_td(self, attributes):
for name, value in attributes:
if name == "valign":
self.inside_td_element = 1
self.starting_description = 1
else:
self.inside_td_element = 1
self.starting_description = 1
def end_td(self):
self.inside_td_element = 0
def handle_data(self, data):
if self.inside_td_element:
if self.starting_description:
self.descriptions.append(data)
self.starting_description = 0
else:
self.descriptions[-1] += data
def get_descriptions(self):
return self.descriptions
counter = 0
trlist = []
dtablelist = []
while counter < len(tablelist):
trsegment = tablelist[counter].findAll('tr')
trlist.append(trsegment)
strsegment = str(trsegment)
myparser = MyParser()
myparser.parse(strsegment)
sub = myparser.get_descriptions()
dtablelist.append(sub)
counter = counter + 1
ex = []
dtablelist = [s for s in dtablelist if s != ex]
So what I want to accomplish is take all the tables from an html document, then reprint them onto an Excel spreadsheet. So when I create trlist the output looks like this:
print trlist[1]
[<tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT- SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline"> </font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Title of each class</font></div>
</td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline">Name of exchange</font></td>
<td valign="top" width="25%" style="TEXT-ALIGN: center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman; TEXT-DECORATION: underline"> </font></td>
</tr>, <tr>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"> </font></td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="DISPLAY: inline; FONT-WEIGHT: bold">Common Stock, par value</font> </font></div>
</td>
<td valign="top" width="25%">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center">
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"><font style="FONT-WEIGHT: bold"><font style="FONT-WEIGHT: bold">< <font style="FONT-WEIGHT: bold">NASDAQ Global Market</font></font></font></font></div>
</div>
</td>
<td valign="top" width="25%"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: times new roman"> </font></td>
</tr>,...
As you can see each item in trlist is each individual row ( . . . ) of the table which is what I want. But when I run each trlist item through my sgmllib parser to retrieve the contents between the tags I get this output:
print dtablelist[1]
['\nTitle of each class\n', 'Name of exchange', '\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n', '\n$1.00 per share\n']
As you can see, the output is each of the contents as their own individual string, instead of a list of the contents of each table row (). So essentially I want the output:
[['\nTitle of each class\n', 'Name of exchange'], ['\nCommon Stock, par value\n', '\n\nNASDAQ Global Market\n\n'], ['\n$1.00 per share\n']]
Is it because I have to turn trlist into a string before I parse it with MyParser? Does anyone know any way around this, allowing me to parse lists within lists (aka Inception shit)?
Upvotes: 0
Views: 1208
Reputation: 8695
If somebody is searching for a solution of the same problem but is using python 3:
You don't have to use an external library for parsing an HTML table even if you are using python 3. There the SGMLParser
class was replaced by HTMLParser
from html.parser
. I've written code for a simple derived HTMLParser
class. It is here in a github repo. It simply does remember the current scope of a <td>
, <tr>
or <table>
tag. The advantages over using etree are that it runs correctly on non-xml-compliant html and that it doesn't use external libraries.
You can use that class (here named HTMLTableParser
) the following way:
import urllib.request
from html_table_parser import HTMLTableParser
target = 'http://www.twitter.com'
# get website content
req = urllib.request.Request(url=target)
f = urllib.request.urlopen(req)
xhtml = f.read().decode('utf-8')
# instantiate the parser and feed it
p = HTMLTableParser()
p.feed(xhtml)
print(p.tables)
The output of this is a list of 2D-lists representing tables. It looks maybe like this:
[[[' ', ' Anmelden ']],
[['Land', 'Code', 'Für Kunden von'],
['Vereinigte Staaten', '40404', '(beliebig)'],
['Kanada', '21212', '(beliebig)'],
...
['3424486444', 'Vodafone'],
[' Zeige SMS-Kurzwahlen für andere Länder ']]]
Upvotes: 1
Reputation: 562
Using lxml.html:
>>> import lxml.html
>>> data = ["<tr><td>test</td><td>help</td></tr>", "<tr><td>data1</td><td>data2</td></tr>"]
>>> [lxml.html.fromstring(tr).xpath(".//text()") for tr in data]
[['test', 'help'], ['data1', 'data2']]
And here is some more complete code. It stores the text in a list containing a list of tables, and each table has a list of tr's, and each tr has a list of all the text.
import urllib
import lxml.html
data = urllib.urlopen('http://www.sec.gov/Archives/edgar/data/8177/000114036111018563/form10k.htm').read()
tree = lxml.html.fromstring(data)
tables = []
for tbl in tree.iterfind('.//table'):
tele = []
tables.append(tele)
for tr in tbl.iterfind('.//tr'):
text = [e.strip() for e in tr.xpath('.//text()') if len(e.strip()) > 0]
tele.append(text)
print tables
Hope this helps, cheers!
Upvotes: 2