Reputation: 2021
I have the following code
inf = codecs.open(inPath , encoding='utf-8')
outf = codecs.open(outPath, encoding='utf-8', mode='w')
old = u'’;'
new = u'’;'
for line in inf:
line = line.replace(old,new)
asc = line.encode('ascii', 'xmlcharrefreplace')
outf.write(asc)
# print asc
inf.close()
outf.close()
This (correctly) converts smart quotes and accented characters etc, into their html entity format, using the numeric format. It will convert
<p>Dreams like: “Someday I’ll travel to…; someday I’ll write a book;
into
<p>Dreams like: “Someday I’ll travel to…; someday I’ll write a book;
This is all correct.
However code further down stream, sees the …;
in the middle, drops the double semi-colon and then complains that it has not got a valid entity. I can't change this code.
As you can see from my code, I have caught one case where an entity is followed by a semi-colon. I don't want to replace all the semi-colons in the source.
How can I detect a semi colon that follows a UTF-8 character with a code point > 127, so that I can replace it with ;
? Thanks.
Upvotes: 1
Views: 102
Reputation: 2021
Face Palm!
If I convert to htmlentites first, and then replace ;;
with ;E
that solves my problem.
Note to self - consider WHERE you do things, as well as what to do!
Upvotes: 1