Ian
Ian

Reputation: 2021

Need RE to detect UTF-8

I have the following code

inf = codecs.open(inPath , encoding='utf-8')
outf = codecs.open(outPath, encoding='utf-8', mode='w')
old = u'’;'
new = u'’&#59;'
for line in inf:
    line = line.replace(old,new)
    asc = line.encode('ascii', 'xmlcharrefreplace')    
    outf.write(asc)
    # print asc
inf.close()
outf.close()

This (correctly) converts smart quotes and accented characters etc, into their html entity format, using the numeric format. It will convert

<p>Dreams like: “Someday I’ll travel to…; someday I’ll write a book;

into

<p>Dreams like: &#8220;Someday I&#8217;ll travel to&#8230;; someday I&#8217;ll write a book; 

This is all correct.

However code further down stream, sees the &#8230;; in the middle, drops the double semi-colon and then complains that it has not got a valid entity. I can't change this code.

As you can see from my code, I have caught one case where an entity is followed by a semi-colon. I don't want to replace all the semi-colons in the source.

How can I detect a semi colon that follows a UTF-8 character with a code point > 127, so that I can replace it with &#59;? Thanks.

Upvotes: 1

Views: 102

Answers (1)

Ian
Ian

Reputation: 2021

Face Palm!

If I convert to htmlentites first, and then replace ;; with ;&#69; that solves my problem.

Note to self - consider WHERE you do things, as well as what to do!

Upvotes: 1

Related Questions