sam
sam

Reputation: 655

Python remove elements from a file

Here is my code snippet:

from HTMLParser import HTMLParser
# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
        def handle_endtag(self, tag):
                if(tag == 'tr'):
                    textFile.write('\n')
        def handle_data(self, data):
                textFile.write(data+"\t")

textFile = open('instaQueryResult', 'w+')

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
fh = open('/data/aman/aggregate.html','r')
l = fh.readlines()
for line in l:
        parser.feed(line)

I parse a HTML file and get the following expected output:

plantype        count(distinct(SubscriberId))   sum(DownBytesNONE)      sum(UpBytesNONE)            sum(SessionCountNONE)
1006657 341175  36435436130     36472526498     694016
1013287 342280  36694005846     36533489363     697098
1006613 343867  36763692173     36755893252     699976
1014883 342436  36575951812     36572503611     695683
1003022 343238  36705838418     36637429353     698618
plantype        count(distinct(SubscriberId))   sum(DownBytesNONE)      sum(UpBytesNONE)            sum(SessionCountNONE)
1013287 342280  36694005846     36533489363     697098
1006657 341175  36435436130     36472526498     694016
1006613 343867  36763692173     36755893252     699976
1014883 342436  36575951812     36572503611     695683
1003022 343238  36705838418     36637429353     698618

This output is correct but I want the headers to be removed. I the first line containing the headers to be removed from the file leaving with just values.

Expected Output:

1006657 341175  36435436130     36472526498     694016
1013287 342280  36694005846     36533489363     697098
1006613 343867  36763692173     36755893252     699976
1014883 342436  36575951812     36572503611     695683
1003022 343238  36705838418     36637429353     698618
1013287 342280  36694005846     36533489363     697098
1006657 341175  36435436130     36472526498     694016
1006613 343867  36763692173     36755893252     699976
1014883 342436  36575951812     36572503611     695683
1003022 343238  36705838418     36637429353     698618

Upvotes: 0

Views: 73

Answers (3)

wolfrevo
wolfrevo

Reputation: 7303

I assume that your html data has following form:

<table>
    <tr>
        <td>plantype</td>
        <td>count(distinct(SubscriberId))</td>
        ...
    </tr>
    <tr>
        <td>1006657</td>
        <td>341175</td>
        ...
    </tr>
</table>

You could use a row_count variable to check if you are in the first tr-tag. Set row_count to 0 with handle_starttag. check it (and increment it) in handle_endtag:

class MyHTMLParser(HTMLParser):
    row_count = 0
    def handle_starttag(self, tag, attrs):
        if (tag == 'table'):
            self.row_count = 0

    def handle_endtag(self, tag):
        if (tag == 'tr') and (self.row_count > 0):
            textFile.write('\n')
        self.row_count += 1

    def handle_data(self, tag):
        if self.row_count > 0:
            textFile.write(data+"\t")

Upvotes: 0

shaktimaan
shaktimaan

Reputation: 12092

Since you are trying to get rid of anything that does not have numbers in it you could try modifying your handle_data(self, data) method as:

def handle_data(self, data):
    if data.isdigit():
        textFile.write(data+"\t")

Upvotes: 1

Ericson Willians
Ericson Willians

Reputation: 7845

Try this:

fh = open('/data/aman/aggregate.html','r')
l = fh.readlines()
for line in l:
    if 'plantype' not in line:
        parser.feed(line)

You're reading a file line by line. When you put an "if 'part of the string' not in line", it executes the next block just for the other lines (The ones that you want).

Upvotes: 0

Related Questions