Need RE to detect UTF-8

Question

I have the following code

inf = codecs.open(inPath , encoding='utf-8')
outf = codecs.open(outPath, encoding='utf-8', mode='w')
old = u'’;'
new = u'’;'
for line in inf:
    line = line.replace(old,new)
    asc = line.encode('ascii', 'xmlcharrefreplace')    
    outf.write(asc)
    # print asc
inf.close()
outf.close()

This (correctly) converts smart quotes and accented characters etc, into their html entity format, using the numeric format. It will convert

Dreams like: “Someday I’ll travel to…; someday I’ll write a book;

into

Dreams like: “Someday I’ll travel to…; someday I’ll write a book;

This is all correct.

However code further down stream, sees the …; in the middle, drops the double semi-colon and then complains that it has not got a valid entity. I can't change this code.

As you can see from my code, I have caught one case where an entity is followed by a semi-colon. I don't want to replace all the semi-colons in the source.

How can I detect a semi colon that follows a UTF-8 character with a code point > 127, so that I can replace it with ;? Thanks.

Ian · Accepted Answer

Face Palm!

If I convert to htmlentites first, and then replace ;; with ;E that solves my problem.

Note to self - consider WHERE you do things, as well as what to do!

Need RE to detect UTF-8

Answers (1)

Related Questions