file on UTF-8 and ISO8859-1

Question

Currently I have a program, that is trying to mimic the functionality of the (linux) file command. I parse a .txt file with some characters, and interpret it to its respective interpretation. However, I struggle in differentiating file, when it comes to ISO8859-1 (latin 1). As it converts ISO8859-1 characters as UTF-8 encodings instead (for instance the æ = e6, is encoded as c3 b8 instead?).

When I make and pass this .txt into file:

printf "æøå" > test.txt

file test.txt

it returns simply:

UTF-8 Unicode text, with no line terminators.

* od -c -tx1 test.txt : returns *

0000000 303 246 303 270 303 245
         c3  a6  c3  b8  c3  a5
0000006

Can anyone explain to me why this is the case, as the 'æøå' prefix is contained within ISO8859-1 encoding, but is then interpreted as a UTF8 encoding instead?

Bodo · Accepted Answer

Obviously your file contains UTF-8 encoding. For example c3 a6 is the UTF-8 encoding for æ.

Probably your system locale is set to something with UTF-8. You can check this by running the locale command.

To convert your file from UTF-8 to ISO8859-1 you can use

recode utf8..iso8859-1 test.txt

After this you will get

$ od -c -tx1 test.txt            
0000000 346 370 345
         e6  f8  e5
0000003

As noted by R.., you might have to install recode if it is not already installed. You can also use iconv, but this tool cannot do in-place modification. See also Best way to convert text files between character sets? and https://unix.stackexchange.com/q/10241/330217

file on UTF-8 and ISO8859-1

Answers (2)

Related Questions