How can I convert HTML into text without markup in Python?

Question

I need to get plain text from an HTML document while honoring elements as newlines. BeautifulSoup.text does not process and newlines. HTML2Text is quite nice, but it converts to markdown. How else could I approach this?

That1Guy · Accepted Answer

I like to use the following method. You can do a manual .replace(' ',' ') on the string before passing it to strip_tags(html) to honor new lines.

From this question:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

How can I convert HTML into text without markup in Python?

Answers (2)

Related Questions