user12843931
user12843931

Reputation:

Is a .txt expected to be in UTF-8 encoding these days? Must I end it with .utf8?

I'm producing plain-text files. I do not use ASCII/ANSI but UTF-8 encoding, since the year is 2020 and not 1995. Unicode/UTF-8 is very well established now and it would be madness to assume no UTF-8 support these days.

At the same time, I have a feeling that plain-text files (.txt) are associated with ANSI/ASCII encoding, as in, because it's so primitive-looking it must also be primitive in the encoding it uses.

However, I wish to use all kinds of Unicode characters, and not just be limited to the basic ANSI/ASCII ones.

Since plain-text has no metadata like HTML does, there is (beknownst to me) no way to tell the reader that this .txt uses Unicode/UTF-8, and from what I have learned, you cannot detect it reliably but have to make "educated guesses".

I have seen people add .utf8 to the end of text files before, but this seems kind of ugly and I strongly question how widespread support for this is...

Should I do this?

test.txt.utf8

Whenever the .txt file is using UTF-8? Or will it just make it even harder for people to open them with no actual benefit in terms of detecting it as UTF-8?

Upvotes: 1

Views: 2316

Answers (1)

devio
devio

Reputation: 37215

You do not elaborate on the use case of the text files you generate, but actually the "way to tell the reader that this .txt uses Unicode/UTF-8" is the Byte Order Mark at the beginning of the text file. By the way it is represented in actual bytes, it tells the reader which Unicode encoding to use to read the file.

From the Unicode FAQ:

Bytes           Encoding Form
00 00 FE FF     UTF-32, big-endian
FF FE 00 00     UTF-32, little-endian
FE FF           UTF-16, big-endian
FF FE           UTF-16, little-endian
EF BB BF        UTF-8

Upvotes: 0

Related Questions