Reputation: 1607
I have made a simple HTML parser which is basically a direct copy from the docs. I am having trouble unescaping special characters without also splitting up data into multiple chunks.
Here is my code with a simple example:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.data = []
def handle_starttag(self, tag, attrs):
#print (tag,attrs)
pass
def handle_endtag(self, tag):
#print (tag)
pass
def handle_data(self, data):
self.data.append(data)
def handle_charref(self, ref):
self.handle_entityref("#" + ref)
def handle_entityref(self, ref):
self.handle_data(self.unescape("&%s;" % ref))
n = "<strong>I <3s U & you luvz me</strong>"
parser = MyHTMLParser()
parser.feed(n)
parser.close()
data = parser.data
print(data)
The issue is that this returns 5 separate bits of data
['I ', u'<', '3s U ', u'&', ' you luvz me']
Where what I want is the single string:
['I <3s U & you luvz me']
Thanks JP
Upvotes: 4
Views: 753
Reputation: 1702
You can refer this answer.
And edit html_to_text
function for you want.
from HTMLParser import HTMLParser
n = "<strong>I <3s U & you luvz me</strong>"
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def handle_entityref(self, name):
self.fed.append('&%s;' % name)
def get_data(self):
return ''.join(self.fed)
def html_to_text(html):
s = MLStripper()
s.feed(html)
return HTMLParser().unescape(s.get_data())
print html_to_text(n)
Output:
I <3s U & you luvz me
Upvotes: 1
Reputation: 4129
Remember that the purpose of HTMLParser is to let you build a document tree from an input. If you don't care at all about the document's structure, then the str.join
solution @falsetru gives will be fine. You can be certain that all element tags and comments will be filtered out.
However, if you do need the structure for more complex scenarios then you have to build a document tree. The handle_starttag
and handle_endtag
methods are here for this.
First we need a basic tree that can hold some information.
class Element:
def __init__(self, parent, tag, attrs=None):
self.parent = parent
self.tag = tag
self.children = []
self.attrs = attrs or []
self.data = ''
Now you need to make the HTMLParser make a new node on every handle_starttag
and move up the tree on every handle_endtag
. We also pass the parsed data to the current node instead of holding it in the parser.
class MyHTMLParser(HTMLParser):
def __init__(self):
super().__init__()
self.root = Element(NONE, '__DOCROOT__') # Special root node for us
self.current = self.root
def handle_starttag(self, tag, attrs):
newel = Element(self.current tag, attrs)
self.current.children.append(newel)
self.current = newel
def handle_endtag(self, tag):
self.current = self.current.parent
def handle_data(self, data):
self.current.data += data
def handle_charref(self, ref): # No changes here
self.handle_entityref('#' + ref)
def handle_entityref(self, ref): # No changes here either
self.handle_data(self.unescape("&%s" % ref))
Now you can access the tree on MyHTMLParser.root
to get the data from any element as you like. For example
n = '<strong>I <3s U & you luvz me</strong>'
p = MyHTMLParser()
p.feed(n)
p.close()
def print_tree(node, indent=0):
print(' ' * indent + node.tag)
print(' ' * indent + ' ' + node.data)
for c in node.children:
print_tree(c, indent + 1)
print_tree(p.root)
This will give you
__DOCROOT__
strong
I <3s U & you luvz me
If instead you parsed n = <html><head><title>Test</title></head><body><h1>I <3s U & you luvz me</h1></body></html>
You would get.
__DOCROOT__
html
head
title
Test
body
h1
I <3s U & you luvz me
Next up is to make the tree building robust and handle cases like mismatched or implicit endtags. You will also want to add some nice find('tag')
like methods on Element
for traversing the tree. Do it well enough and you'll have made the next BeautifulSoup.
Upvotes: 1
Reputation: 369154
Join the list of strings using str.join
:
>>> ''.join(['I ', u'<', '3s U ', u'&', ' you luvz me'])
u'I <3s U & you luvz me'
Alternatively, you can use external libraries, like lxml
:
>>> import lxml.html
>>> n = "<strong>I <3s U & you luvz me</strong>"
>>> root = lxml.html.fromstring(n)
>>> root.text_content()
'I <3s U & you luvz me'
Upvotes: 3