Reputation:
I'm producing plain-text files. I do not use ASCII/ANSI but UTF-8 encoding, since the year is 2020 and not 1995. Unicode/UTF-8
is very well established now and it would be madness to assume no UTF-8 support these days.
At the same time, I have a feeling that plain-text files (.txt)
are associated with ANSI/ASCII
encoding, as in, because it's so primitive-looking it must also be primitive in the encoding it uses.
However, I wish to use all kinds of Unicode characters, and not just be limited to the basic ANSI/ASCII
ones.
Since plain-text has no metadata like HTML does, there is (beknownst to me) no way to tell the reader that this .txt
uses Unicode/UTF-8
, and from what I have learned, you cannot detect
it reliably but have to make "educated guesses".
I have seen people add .utf8
to the end of text files before, but this seems kind of ugly and I strongly question how widespread support for this is...
Should I do this?
test.txt.utf8
Whenever the .txt file is using UTF-8? Or will it just make it even harder for people to open them with no actual benefit in terms of detecting it as UTF-8?
Upvotes: 1
Views: 2316
Reputation: 37215
You do not elaborate on the use case of the text files you generate, but actually the "way to tell the reader that this .txt uses Unicode/UTF-8" is the Byte Order Mark at the beginning of the text file. By the way it is represented in actual bytes, it tells the reader which Unicode encoding to use to read the file.
From the Unicode FAQ:
Bytes Encoding Form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8
Upvotes: 0