tipu
tipu

Reputation: 9604

Converting html entities into their values in python

I use this regex on some input,

[^a-zA-Z0-9@#]

However this ends up removing lots of html special characters within the input, such as

#227;, #1606;, #1588; (i had to remove the & prefix so that it wouldn't 
show up as the actual value..)

is there a way that I can convert them to their values so that it will satisfy the regexp expression? I also have no idea why the text decided to be so big.

Upvotes: 2

Views: 2535

Answers (3)

Alex Martelli
Alex Martelli

Reputation: 881635

Given that your text appears to have numeric-coded, not named, entities, you can first convert your byte string that includes xml entity defs (ampersand, hash, digits, semicolon) to unicode:

import re
xed_re = re.compile(r'&#(\d+);')
def usub(m): return unichr(int(m.group(1)))

s = 'ã, ن, ش'
u = xed_re.sub(usub, s)

if your terminal emulator can display arbitrary unicode glyphs, a print u will then show

ã, ن, ش

In any case, you can now, if you wish, use your original RE and you won't accidentally "catch" the entities, only ascii letters, digits, and the couple of punctuation characters you listed. (I'm not sure that's what you really want -- why not accented letters but just ascii ones, for example? -- but, if it is what you want, it will work).

If you do have named entities in addition to the numeric-coded ones, you can also apply the htmlentitydefs standard library module recommended in another answer (it only deals with named entities which map to Latin-1 code points, however).

Upvotes: 4

user319799
user319799

Reputation:

You can adapt the following script:

import htmlentitydefs
import re

def substitute_entity (match):
    name = match.group (1)
    if name in htmlentitydefs.name2codepoint:
        return unichr (htmlentitydefs.name2codepoint[name])
    elif name.startswith ('#'):
        try:
            return unichr (int (name[1:]))
        except:
            pass

    return '?'

print re.sub ('&(#?\\w+);', substitute_entity, 'x « y &wat; z {')

Produces the following answer here:

x « y ? z {

EDIT: I understood the question as "how to get rid of HTML entities before further processing", hope I haven't wasted time on answering a wrong question ;)

Upvotes: 1

Trey Hunner
Trey Hunner

Reputation: 11814

Without knowing what the expression is being used for I can't tell exactly what you need.

This will match special characters or strings of characters excluding letters, digits, @, and #:

[^a-zA-Z0-9@#]*|#[0-9A-Za-z]+;

Upvotes: 0

Related Questions