How replacing with regex characters generated by encoding errors when embedded in text

Question

I need to replace the following characters with regex (gsub):

ÃÆÃÂ¨ -> è ÃÆÃÂ -> à ÃÆÃÂ² -> ò ÃÆÃÂ¬ -> ì ÃÆÃÃ¹ -> ù

My strategy is to first remove the first three characters ÃÆÃ that are common to all and the move to the last, leaving à at the end since it is basically the lowest common denominator. Now gsub correctly removes the first three but then it seams it doesn't see the final ones - like Â¨ - but I noticed it sees Ã± (for ñ).

By copy/pasting the characters into the text editor I noticed they cause weird behaviours (such as moving the cursor forward by few positions).

My dataset was downloaded from a website that itself has encoding problems for the oldest pages but not for the most recent ones (I think they corrected the encoding problem sometime in the last years). Visiting the oldest pages you can still see the very same ÃÆÃÂ¨ in plain sight. Then the problem is not (I assume) in the encoding of my file.

That is, the encoding errors are limited to regions of the dataset and are not the result of an encoding issue with the whole text corpus.

How replacing with regex characters generated by encoding errors when embedded in text

Answers (1)

Related Questions