CptNemo
CptNemo

Reputation: 6755

How replacing with regex characters generated by encoding errors when embedded in text

I need to replace the following characters with regex (gsub):

ÃÆè -> è ÃÆà-> à ÃÆò -> ò ÃÆì -> ì ÃÆÃù -> ù

My strategy is to first remove the first three characters ÃÆà that are common to all and the move to the last, leaving à at the end since it is basically the lowest common denominator. Now gsub correctly removes the first three but then it seams it doesn't see the final ones - like ¨ - but I noticed it sees ñ (for ñ).

By copy/pasting the characters into the text editor I noticed they cause weird behaviours (such as moving the cursor forward by few positions).

My dataset was downloaded from a website that itself has encoding problems for the oldest pages but not for the most recent ones (I think they corrected the encoding problem sometime in the last years). Visiting the oldest pages you can still see the very same ̮̬ in plain sight. Then the problem is not (I assume) in the encoding of my file.

That is, the encoding errors are limited to regions of the dataset and are not the result of an encoding issue with the whole text corpus.

Upvotes: 1

Views: 302

Answers (1)

CptNemo
CptNemo

Reputation: 6755

The problem when the characters are not correctly displayed is to understand exactly how they are parsed by the regex. In my case, as explained, the encoding errors where limited to few strings in my dataset. Then Encoding() was not applicable.

I solved the problem by visualising the problematic characters directly in R console. In console they appear like Ã\u0083Æ\u0092Ã\u0082¨ while in the R-studio they were visualised as Ã Æ Ã Â¨. What visualised in console was what I needed for a correct match with regex: gsub("Ã\u0083Æ\u0092Ã\u0082¨"...

Upvotes: 1

Related Questions