Reputation: 1530
I need to convert any html entity into its ASCII equivalent using Python. My use case is that I am cleaning up some HTML used to build emails to create plaintext emails from the HTML.
Right now, I only really know how to create unicode from these entities when I need ASCII (I think) so that the plaintext email reads correctly with things like accented characters. I think a basic example is the html entity "& aacute;" or á being encoded into ASCII.
Furthermore, I'm not even 100% sure that ASCII is what I need for a plaintext email. As you can tell, I'm completely lost on this encoding stuff.
Upvotes: 4
Views: 9082
Reputation: 165
Here is a complete implementation that also handles unicode html entities. You might find it useful.
It returns a unicode string that is not ascii, but if you want plain ascii, you can modify the replace operations so that it replaces the entities to empty string.
def convert_html_entities(s):
matches = re.findall("&#\d+;", s)
if len(matches) > 0:
hits = set(matches)
for hit in hits:
name = hit[2:-1]
try:
entnum = int(name)
s = s.replace(hit, unichr(entnum))
except ValueError:
pass
matches = re.findall("&#[xX][0-9a-fA-F]+;", s)
if len(matches) > 0:
hits = set(matches)
for hit in hits:
hex = hit[3:-1]
try:
entnum = int(hex, 16)
s = s.replace(hit, unichr(entnum))
except ValueError:
pass
matches = re.findall("&\w+;", s)
hits = set(matches)
amp = "&"
if amp in hits:
hits.remove(amp)
for hit in hits:
name = hit[1:-1]
if htmlentitydefs.name2codepoint.has_key(name):
s = s.replace(hit, unichr(htmlentitydefs.name2codepoint[name]))
s = s.replace(amp, "&")
return s
Edit: added matching for hexcodes. I've been using this for a while now, and ran into my first situation with ' which is a single quote/apostrophe.
Upvotes: 8
Reputation: 1137
We put up a little module with agazso's function:
http://github.com/ARTFL/util/blob/master/ents.py
We find agazso's function to faster than the alternatives for ent conversion. Thanks for posting it.
Upvotes: 0
Reputation: 123468
You can use the htmlentitydefs package:
import htmlentitydefs
print htmlentitydefs.entitydefs['aacute']
Basically, entitydefs
is just a dictionary, and you can see this by printing it at the python prompt:
from pprint import pprint
pprint htmlentitydefs.entitydefs
Upvotes: 1
Reputation: 881477
ASCII is the American Standard Code for Information Interchange and does not include any accented letters. Your best bet is to get Unicode (as you say you can) and encode it as UTF-8 (maybe ISO-8859-1 or some weird codepage if you're dealing with seriously badly coded user-agents/clients, sigh) -- the content type header of that part together with text/plain can express what encoding you've chosen to use (I do recommend trying UTF-8 unless you have positively demonstrated it cannot work -- it's almost universally supported these days and MUCH more flexible than any ISO-8859 or "codepage" hack!).
Upvotes: 2