Reputation: 844
I have about 3000000 strings with HTML tags. I am trying to remove the tags and take the content.I have a code in following way. But it is taking a lot of time. Is there any way I can do parallel processing? Any way I can speed my implementation?
from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
for each in lis:
if each != None:
each = strip_tags(each)
st+=" "+each.decode('utf-8')
Upvotes: 0
Views: 159
Reputation: 1148
I haven't checked the code bellow, but it answers strictly the question of how to process the input in parallel. I think your code could benefit from other optimisations as well, but check the other answers for that.
from HTMLParser import HTMLParser
from multiprocessing import Pool
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
if __name__ == '__main__':
lis=[]#you should get lis contents here
pool = Pool()
pool.map(strip_tags, (each for each in lis if each !=None))
pool.close()
pool.join()
Upvotes: 1
Reputation: 916
To further answer your question about parallel processing: yes you could use that here. One idea is to use map and reduce with ipython mutiprocessing, hadoop, AWS' EMR, etc. to strip all those strings and concat them into some file or other output stream.
Upvotes: 1
Reputation: 3154
Suppose you have a multi-core computer that has 8 cores. Use Linux bash command:
split -l 375000 filename
This will give you 8 files with 375000 lines each. The filenames will be "xaa", "xab", "xac", ... and "xah". Next, run your program 8 times on the 8 smaller files (use & at the end of each command). The OS should run each of them on a different core in parallel. Then concatenate the 8 output files into one result file.
Upvotes: 1
Reputation: 40693
Doing string concatenation in a for loop will create problems as a new string object will need to be created for each concatenation (twice for each iteration of the loop in your case).
You can use join and a generator to improve efficiency.
for each in lis:
if each != None:
each = strip_tags(each)
st+=" "+each.decode('utf-8')
becomes:
st = " ".join(strip_tags(each).decode('utf-8') for each in lis if each is not None)
Upvotes: 2