user29772
user29772

Reputation: 1467

Python HTML removal

How can I remove all HTML from a string in Python? For example, how can I turn:

blah blah <a href="blah">link</a>

into

blah blah link

Thanks!

Upvotes: 7

Views: 10973

Answers (9)

Igor Medeiros
Igor Medeiros

Reputation: 4126

There's a simple way to this:

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

The idea is explained here: http://youtu.be/2tu9LTDujbw

You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s

PS - If you're interested in the class(about smart debugging with python) I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1. It's free!

You're welcome! :)

Upvotes: 1

David Kent Snyder
David Kent Snyder

Reputation: 11

I just wrote this. I need it. It uses html2text and takes a file path, although I would prefer a URL. The output of html2text is stored in TextFromHtml2Text.text print it, store it, feed it to your pet canary.

import html2text
class TextFromHtml2Text:

    def __init__(self, url = ''):
        if url == '':
            raise TypeError("Needs a URL")
        self.text = ""
        self.url = url
        self.html = ""
        self.gethtmlfile()
        self.maytheswartzbewithyou()

    def gethtmlfile(self):
        file = open(self.url)
        for line in file.readlines():
            self.html += line

    def maytheswartzbewithyou(self):
        self.text = html2text.html2text(self.html)

Upvotes: 1

jfs
jfs

Reputation: 414745

Regexs, BeautifulSoup, html2text don't work if an attribute has '>' in it. See Is “>” (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?

'HTML/XML parser'-based solution might help in such cases e.g., stripogram suggested by @MrTopf does work.

Here's ElementTree-based solution:

####from xml.etree import ElementTree as etree # stdlib
from lxml import etree

str_ = 'blah blah <a href="blah">link</a> END'
root = etree.fromstring('<html>%s</html>' % str_)
print ''.join(root.itertext()) # lxml or ElementTree 1.3+

Output:

blah blah link END

Upvotes: 5

RexE
RexE

Reputation: 17733

html2text will do something like this.

Upvotes: 2

MrTopf
MrTopf

Reputation: 4853

There is also a small library called stripogram which can be used to strip away some or all HTML tags.

You can use it like this:

from stripogram import html2text, html2safehtml
# Only allow <b>, <a>, <i>, <br>, and <p> tags
clean_html = html2safehtml(original_html,valid_tags=("b", "a", "i", "br", "p"))
# Don't process <img> tags, just strip them out. Use an indent of 4 spaces 
# and a page that's 80 characters wide.
text = html2text(original_html,ignore_tags=("img",),indent_width=4,page_width=80)

So if you want to simply strip out all HTML, you pass valid_tags=() to the first function.

You can find the documentation here.

Upvotes: 10

Kenan Banks
Kenan Banks

Reputation: 212138

When your regular expression solution hits a wall, try this super easy (and reliable) BeautifulSoup program.

from BeautifulSoup import BeautifulSoup

html = "<a> Keep me </a>"
soup = BeautifulSoup(html)

text_parts = soup.findAll(text=True)
text = ''.join(text_parts)

Upvotes: 18

riza
riza

Reputation: 17144

>>> import re
>>> s = 'blah blah <a href="blah">link</a>'
>>> q = re.compile(r'<.*?>', re.IGNORECASE)
>>> re.sub(q, '', s)
'blah blah link'

Upvotes: 0

George V. Reilly
George V. Reilly

Reputation: 16333

Try Beautiful Soup. Throw away everything except the text.

Upvotes: 3

Luke Woodward
Luke Woodward

Reputation: 65044

You can use a regular expression to remove all the tags:

>>> import re
>>> s = 'blah blah <a href="blah">link</a>'
>>> re.sub('<[^>]*>', '', s)
'blah blah link'

Upvotes: 7

Related Questions