Reputation: 954
I have a file which is described under Unix as:
$file xxx.csv
xxx.csv: UTF-8 Unicode text, with very long lines
Viewing it in less
/vi
will render some special chars (ßÄ°...) unreadable (├╝); Windows will also not display it; importing it directly into a db will just change the special characters to some other special characters (+ä, +ñ, ...).
I wanted to convert it now to a "default readable" encoding with iconv. When I try to convert it with iconv
$iconv -f UTF-8 -t ISO-8859-1 xxx.csv > yyy.csv
iconv: illegal input sequence at position 1234
using UNICODE as input and UTF-8 as output will return the same message
I am guessing the file is somewhat encoded in another format which I do not know - how can I find out which format in order to convert it to something "universally" readable ...
Upvotes: 9
Views: 32896
Reputation: 570
Converting from UTF-8 to ISO-8859-1 only works if your UTF-8 text only has characters that can be represented in ISO-8859-1. If this is not the case, you should specify what needs to happen to these characters, either ignoring (//IGNORE) or approximating (//TRANSLIT) them. Try one of these two:
iconv -f UTF-8 -t ISO-8859-1//IGNORE --output=outfile.csv inputfile.csv
iconv -f UTF-8 -t ISO-8859-1//TRANSLIT --output=outfile.csv inputfile.csv
In most cases, I guess approximation is the best solution, mapping e.g. accented characters to their unaccented counterparts, the euro sign to EUR, etc...
Upvotes: 16
Reputation: 10083
If you are not sure about the file type you dealing with then you can find it as follows,
file file_name
The above command will give you the file format. Then iconv can be used accordingly. For example if the file format is UTF-16 and you want to convert it to UTF-8 then following can be used.
iconv -f UTF-16 -t UTF-8 file_name >output_file_name
Hope this gives add on insight to what you are looking for.
Upvotes: 2
Reputation: 954
The problem was that Windows could not interpret the file as UTF-8 on itself. it reads it as asci and then ä becomes a 2 character interpretation ä (ascii 195 164)
trying to convert it, I found a solution that works for me:
iconv -f UTF-8 -t WINDOWS-1252//TRANSLIT --output=outfile.csv inputfile.csv
now I can view the special chars correctly in editors
For SQLServer compability, converting UTF-8 to UTF-16 will work even better ... just the filesize grows quite a bit
Upvotes: 5