laurent
laurent

Reputation: 775

Fixing UTF-8 encoded as ISO-8859-1

Say you have a file which contains both UTF-8 characters and UTF-8 characters there were once read by a program who thought they were ISO-8859-1. So you have things like "é" instead of "é". How do you fix that ?

Upvotes: 1

Views: 232

Answers (1)

laurent
laurent

Reputation: 775

I finally came up with a single sed command that did the job for me :

LANG='' sed -re 's/(\xc3)\x83\xc2([\x80-\xbf])/\1\2/g'

It does not handle unicode code point 0xA0 to 0xBF, but it should be pretty easy to adapt for those.

Upvotes: 1

Related Questions