user168983
user168983

Reputation: 844

How can I speed up string concatenation

I have about 3000000 strings with HTML tags. I am trying to remove the tags and take the content.I have a code in following way. But it is taking a lot of time. Is there any way I can do parallel processing? Any way I can speed my implementation?

from HTMLParser import HTMLParser  
class MLStripper(HTMLParser):
        def __init__(self):
            self.reset()
            self.fed = []
        def handle_data(self, d):
            self.fed.append(d)
        def get_data(self):
            return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

for each in lis:
    if each != None:
        each = strip_tags(each)
        st+=" "+each.decode('utf-8')

Upvotes: 0

Views: 159

Answers (4)

Boggio
Boggio

Reputation: 1148

I haven't checked the code bellow, but it answers strictly the question of how to process the input in parallel. I think your code could benefit from other optimisations as well, but check the other answers for that.

from HTMLParser import HTMLParser  

from multiprocessing import Pool

class MLStripper(HTMLParser):
        def __init__(self):
            self.reset()
            self.fed = []
        def handle_data(self, d):
            self.fed.append(d)
        def get_data(self):
            return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()



if __name__ == '__main__':

    lis=[]#you should get lis contents here

    pool = Pool()

    pool.map(strip_tags, (each for each in lis if each !=None))
    pool.close()
    pool.join()

Upvotes: 1

Sean Azlin
Sean Azlin

Reputation: 916

To further answer your question about parallel processing: yes you could use that here. One idea is to use map and reduce with ipython mutiprocessing, hadoop, AWS' EMR, etc. to strip all those strings and concat them into some file or other output stream.

Upvotes: 1

Marichyasana
Marichyasana

Reputation: 3154

Suppose you have a multi-core computer that has 8 cores. Use Linux bash command:

 split -l 375000 filename

This will give you 8 files with 375000 lines each. The filenames will be "xaa", "xab", "xac", ... and "xah". Next, run your program 8 times on the 8 smaller files (use & at the end of each command). The OS should run each of them on a different core in parallel. Then concatenate the 8 output files into one result file.

Upvotes: 1

Dunes
Dunes

Reputation: 40693

Doing string concatenation in a for loop will create problems as a new string object will need to be created for each concatenation (twice for each iteration of the loop in your case).

You can use join and a generator to improve efficiency.

for each in lis:
    if each != None:
        each = strip_tags(each)
        st+=" "+each.decode('utf-8')

becomes:

st = " ".join(strip_tags(each).decode('utf-8') for each in lis if each is not None)

Upvotes: 2

Related Questions