Sean W.
Sean W.

Reputation: 5132

How can I convert HTML into text without markup in Python?

I need to get plain text from an HTML document while honoring <br> elements as newlines. BeautifulSoup.text does not process <br> and newlines. HTML2Text is quite nice, but it converts to markdown. How else could I approach this?

Upvotes: 4

Views: 449

Answers (2)

That1Guy
That1Guy

Reputation: 7233

I like to use the following method. You can do a manual .replace('<br>','\r\n') on the string before passing it to strip_tags(html) to honor new lines.

From this question:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Upvotes: 4

mishik
mishik

Reputation: 10003

You can strip out tags and replace them with spaces (if needed):

import re

myString = re.sub(r"<(/)?br(/)?>", "\n", myString)
myString = re.sub(r"<[^>]*>", " ", myString)

Upvotes: 0

Related Questions