Reputation: 731
I've read a lot of similar questions, but didn't find a solution for all the issues I get with my data cleanup.
I have a script which crawls a set of websites and get a certain block of text from the pages body.
Issues I get are things like html markup still in the text, different symbols for quotes for example (not ' but ` or even worse cases), stuff like & and so on.
Right now I parse the text through my own cleanup functions but they are not perfect and still miss some cases.
I was wondering is there is a package or common way to cleanup a string from all of these cases and have characters like ` converted to ' and so on?
Upvotes: 2
Views: 119
Reputation: 621
You can use HTMLParser module.
On python 2: from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
On python 3:
from html.parser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.strict = False
self.convert_charrefs= True
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
Upvotes: 1