GeekSince1982
GeekSince1982

Reputation: 731

How clean the string from any none unicode \ special characters, html markup, js - leaving pure text and punctuation - in python?

I've read a lot of similar questions, but didn't find a solution for all the issues I get with my data cleanup.

I have a script which crawls a set of websites and get a certain block of text from the pages body.

Issues I get are things like html markup still in the text, different symbols for quotes for example (not ' but ` or even worse cases), stuff like &amp and so on.

Right now I parse the text through my own cleanup functions but they are not perfect and still miss some cases.

I was wondering is there is a package or common way to cleanup a string from all of these cases and have characters like ` converted to ' and so on?

Upvotes: 2

Views: 119

Answers (2)

troy_achilies
troy_achilies

Reputation: 621

You can use HTMLParser module.

On python 2: from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
def __init__(self):
    self.reset()
    self.fed = []
def handle_data(self, d):
    self.fed.append(d)
def get_data(self):
    return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

On python 3:

from html.parser import HTMLParser

class MLStripper(HTMLParser):
def __init__(self):
    self.reset()
    self.strict = False
    self.convert_charrefs= True
    self.fed = []
def handle_data(self, d):
    self.fed.append(d)
def get_data(self):
    return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Upvotes: 1

MKesper
MKesper

Reputation: 509

Did you have a look at Scrapy?

Upvotes: 0

Related Questions