Reputation: 655
Here is my code snippet:
from HTMLParser import HTMLParser
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
def handle_endtag(self, tag):
if(tag == 'tr'):
textFile.write('\n')
def handle_data(self, data):
textFile.write(data+"\t")
textFile = open('instaQueryResult', 'w+')
# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
fh = open('/data/aman/aggregate.html','r')
l = fh.readlines()
for line in l:
parser.feed(line)
I parse a HTML file and get the following expected output:
plantype count(distinct(SubscriberId)) sum(DownBytesNONE) sum(UpBytesNONE) sum(SessionCountNONE)
1006657 341175 36435436130 36472526498 694016
1013287 342280 36694005846 36533489363 697098
1006613 343867 36763692173 36755893252 699976
1014883 342436 36575951812 36572503611 695683
1003022 343238 36705838418 36637429353 698618
plantype count(distinct(SubscriberId)) sum(DownBytesNONE) sum(UpBytesNONE) sum(SessionCountNONE)
1013287 342280 36694005846 36533489363 697098
1006657 341175 36435436130 36472526498 694016
1006613 343867 36763692173 36755893252 699976
1014883 342436 36575951812 36572503611 695683
1003022 343238 36705838418 36637429353 698618
This output is correct but I want the headers to be removed. I the first line containing the headers to be removed from the file leaving with just values.
Expected Output:
1006657 341175 36435436130 36472526498 694016
1013287 342280 36694005846 36533489363 697098
1006613 343867 36763692173 36755893252 699976
1014883 342436 36575951812 36572503611 695683
1003022 343238 36705838418 36637429353 698618
1013287 342280 36694005846 36533489363 697098
1006657 341175 36435436130 36472526498 694016
1006613 343867 36763692173 36755893252 699976
1014883 342436 36575951812 36572503611 695683
1003022 343238 36705838418 36637429353 698618
Upvotes: 0
Views: 73
Reputation: 7303
I assume that your html data has following form:
<table>
<tr>
<td>plantype</td>
<td>count(distinct(SubscriberId))</td>
...
</tr>
<tr>
<td>1006657</td>
<td>341175</td>
...
</tr>
</table>
You could use a row_count
variable to check if you are in the first tr-tag.
Set row_count
to 0 with handle_starttag
. check it (and increment it) in handle_endtag
:
class MyHTMLParser(HTMLParser):
row_count = 0
def handle_starttag(self, tag, attrs):
if (tag == 'table'):
self.row_count = 0
def handle_endtag(self, tag):
if (tag == 'tr') and (self.row_count > 0):
textFile.write('\n')
self.row_count += 1
def handle_data(self, tag):
if self.row_count > 0:
textFile.write(data+"\t")
Upvotes: 0
Reputation: 12092
Since you are trying to get rid of anything that does not have numbers in it you could try modifying your handle_data(self, data)
method as:
def handle_data(self, data):
if data.isdigit():
textFile.write(data+"\t")
Upvotes: 1
Reputation: 7845
Try this:
fh = open('/data/aman/aggregate.html','r')
l = fh.readlines()
for line in l:
if 'plantype' not in line:
parser.feed(line)
You're reading a file line by line. When you put an "if 'part of the string' not in line", it executes the next block just for the other lines (The ones that you want).
Upvotes: 0