Reputation: 399
Currently I have a program, that is trying to mimic the functionality of the (linux) file command. I parse a .txt file with some characters, and interpret it to its respective interpretation. However, I struggle in differentiating file, when it comes to ISO8859-1 (latin 1). As it converts ISO8859-1 characters as UTF-8 encodings instead (for instance the æ = e6, is encoded as c3 b8 instead?).
When I make and pass this .txt into file:
printf "æøå" > test.txt
file test.txt
it returns simply:
UTF-8 Unicode text, with no line terminators.
* od -c -tx1 test.txt
: returns *
0000000 303 246 303 270 303 245
c3 a6 c3 b8 c3 a5
0000006
Can anyone explain to me why this is the case, as the 'æøå' prefix is contained within ISO8859-1 encoding, but is then interpreted as a UTF8 encoding instead?
Upvotes: 3
Views: 2130
Reputation: 9875
Obviously your file contains UTF-8 encoding. For example c3 a6
is the UTF-8 encoding for æ
.
Probably your system locale is set to something with UTF-8. You can check this by running the locale
command.
To convert your file from UTF-8 to ISO8859-1 you can use
recode utf8..iso8859-1 test.txt
After this you will get
$ od -c -tx1 test.txt
0000000 346 370 345
e6 f8 e5
0000003
As noted by R.., you might have to install recode
if it is not already installed. You can also use iconv
, but this tool cannot do in-place modification. See also
Best way to convert text files between character sets? and https://unix.stackexchange.com/q/10241/330217
Upvotes: 4
Reputation: 215487
Bodo's answer is correct, but I think the root of your problem is the ambiguity of the term "character set". You're correct that all those characters are in the set of characters available in ISO-8859-1, but that's not terribly relevant; all it means is that you can faithfully represent them when encoding your text as ISO-8859-1. The ambiguity (some might even say misuse) of the word "set" here is why, in modern usage, the concept is called "coded character set" or preferably "character encoding", to reflect that the important aspect is how abstract characters in the set of available characters map to stored representations.
As sets, ISO-8859-1 is a subset of Unicode and thus a subset of the set of characters representable by UTF-8. But as encodings they don't agree anywhere except the subset that is ASCII. All other characters present in ISO-8859-1 are represented differently in UTF-8 than in ISO-8859-1; if this weren't the case, there would be no way to represent more than 256 characters since in ISO-8859-1 the meanings of all 256 bytes are assigned (to single characters).
As noted in Bodo's answer, æ is encoded in UTF-8 as c3 a6
, whereas in ISO-8859-1 it's encoded as e6
.
Upvotes: 2