Zac Faragher
Zac Faragher

Reputation: 1001

Why doesn't Git natively support UTF-16?

Git supports several different encoding schemes, UTF-7, UTF-8, and UTF-32, as well as non-UTF ones.

Given this, why doesn't it support UTF-16?

There's a lot of questions that ask how to get Git to support UTF-16, but I don't think that this has been explicitly asked or answered yet.

Upvotes: 14

Views: 5954

Answers (5)

Rusi
Rusi

Reputation: 1182

Git recently has begun to understand encodings such as UTF-16. See gitattributes documentation—search for working-tree-encoding.

If you want .txt files to be UTF-16 without a BOM on Windows machine then add this to your gitattributes file:

*.txt text working-tree-encoding=UTF-16LE eol=CRLF

In response to jthill's comments:

There isn't any doubt that UTF-16 is a mess. However, consider

  • Java uses UTF16

  • As does Microsoft

    Note the line UTF16… the one used for native Unicode encoding on Windows operating systems

  • JavaScript uses a mess between UCS-2 and UTF-16

Upvotes: 5

torek
torek

Reputation: 490128

I devote a significant chunk of a full chapter of my (currently rather moribund) book (see Chapter 3, which is in better shape than later chapters) to the issue of character encoding, because it is a historical mess. It's worth mentioning here, though, that part of the premise of this question—that Git supports UTF-7 and UTF-32 in some way—is wrong: UTF-7 is a standard that never even came about and should probably never be used at all (so naturally, older Internet Explorer versions do, and this leads to the security issue mentioned on the linked Wikipedia page).

That said, let's first separate character encoding from code pages. (See footnote-ish section below as well.) The fundamental problem here is that computers—well, modern ones anyway—work with a series of 8-bit bytes, with each byte representing an integer in the range [0..255]. Older systems had 6, 7, 8, and even 9-bit bytes, though I think calling anything less than 8 bits a "byte" is misleading. (BBN's "C machines" had 10-bit bytes!) In any case, if one byte represents one character-symbol, this gives us an upper limit of 256 kinds of symbols. In those bad old days of ASCII, that was sufficient, since ASCII had just 128 symbols, 33 of them being non-printing symbols (control codes 0x00 through 0x1f, plus 0x7f representing DEL or a deleted punch on paper tape, writing them in hexadecimal here).

When we needed more than 94 printable symbols plus the space (0x20), we—by we I mean people using computers all over the world, not specifically me—said: Well, look at this, we have 128 unused encodings, 0x80 through 0xff, let's use some of those! So the French used some for ç and é and so on, and punctuation like « and ». The Czechs needed one for Z-with-caron, ž. The Russians needed lots, for Cyrillic. The Greeks needed lots, and so on. The result was that the upper half of the 8-bit space exploded into many incompatible sets, which people called code pages.

Essentially, the computer stores some eight-bit byte value, such as 235 decimal (0xEB hex), and it's up to something else—another computer program, or ultimately a human staring at a screen, to interpret that 235 as, say, a Cyrillic л character, or a Greek λ, or whatever. The code page, if we are using one, tells us what "235" means: what sort of semantics we should impose on this.

The problem here is that there is a limit on how many character codes we can support. If we want to have the Cyrillic L (л) coexist with the Greek L (lambda, λ), we can't use both CP-1251 and CP-1253 at the same time, so we need a better way to encode the symbol. One obvious way is to stop using one-byte values to encode symbols: if we use two-byte values, we can encode 65536 values, 0x0000 through 0xffff inclusive; subtract a few for control codes and there is still room for many alphabets. However, we rapidly blew through even this limit, so we went to Unicode, which has room for 1,114,112 of what it calls code points, each of which represents some sort of symbol with some sort of semantic meaning. Somewhat over 100,000 of these are now in use, including Emoji like 😀 and 😱.

Encoding Unicode into bytes or words

This is where UTF-8, UTF-16, UTF-32, UCS-2, and UCS-4 all come in. These are all schemes for encoding Unicode code points—one of those ~1 million values—into byte-streams. I'm going to skip over the UCS ones entirely and look only at the UTF-8 and UTF-16 encodings, since those are the two that are currently the most interesting. (See also What are Unicode, UTF-8, and UTF-16?)

The UTF-8 encoding is straightforward: any code point whose decimal value is less than 128 is encoded as a byte containing that value. This means that ordinary ASCII text characters remain ordinary ASCII text characters. Code points in 0x0080 (128 decimal) through 0x07ff (2047 decimal) encode into two bytes, both of whose value is in the 128-255 range and hence distinguishable from a one-byte encoded value. Code points in the 0x0800 through 0xffff range encode into three bytes in that same 128-255 range, and the remaining valid values encode into four such bytes. The key here as far as Git itself is concerned is that no encoded value resembles an ASCII NUL (0x00) or slash (0x2f).

What this UTF-8 encoding does is to allow Git to pretend that text strings—and especially file names—are slash-separated name components whose ends are, or can be anyway, marked with ASCII NUL bytes. This is the encoding that Git uses in tree objects, so UTF-8 encoded tree objects just fit, with no fiddling required.

UTF-16 encoding uses two paired bytes per character. This has two problems for Git and pathnames. First, a byte within a pair might accidentally resemble /, and all ASCII-valued characters necessarily encode as a pair of bytes where one byte is 0x00 which resembles ASCII NUL. So Git would need to know: this path name has been encoded in UTF-16 and work on byte-pairs. There's no room in a tree object for this information, so Git would need a new object type. Second, whenever we break a 16-bit value into two separate 8-bit bytes, we do this in some order: I either give you the more more significant byte first, then the less significant byte; or I give you the less significant byte first, then the more significant one. This second problem leads to the reason that UTF-16 has byte order marks. UTF-8 needs no byte order mark, and suffices, so why not use that in trees? So Git does.

That's fine for trees, but we also have commits, tags, and blobs

Git does its own interpretation of three of these four kinds of objects:

  1. Commits contain hash IDs.
  2. Trees contain path names, file modes, and hash IDs.
  3. Tags contain hash IDs.

The one that's not listed here is the blob, and for the most part, Git does not do any interpretation of blobs.

To make it easy to understand the commits, trees, and tags, Git constrains all three to be in UTF-8 for the most part. However, Git does allow the log message in a commit, or the tag text in a tag, to go somewhat (mostly) uninterpreted. These come after the header that Git interprets, so even if there is something particularly tricky or ugly at this point, that's pretty safe. (There are some minor risks here since PGP signatures, which appear below the headers, do get interpreted.) For commits in particular, modern Git will include an encoding header line in the interpreted section, and Git can then attempt to decode the commit message body, and re-encode it into whatever encoding is used by whatever program is interpreting the bytes that Git spits out.1

The same rules could work for annotated tag objects. I'm not sure if Git has code to do that for tags (the commit code could mostly be re-used, but tags much more commonly have PGP signatures, and it's probably wiser just to force UTF-8 here). Since trees are internal objects, their encoding is largely invisible anyway—you do not need to be aware of this (except for the issues that I point out in my book).

This leaves blobs, which are the big gorilla.


1This is a recurring theme in the computing world: everything is repeatedly encoded and decoded. Consider how something arrives over Wi-Fi or a cable network connection: it's been encoded into some sort of radio wave or similar, and then some hardware decodes that into a bit-stream, which some other hardware re-encodes into a byte stream. Hardware and/or software strip off headers, interpret the remaining encoding in some way, change the data appropriately, and re-encode the bits and bytes, for another layer of hardware and software to deal with. It's a wonder anything ever gets done.


Blob encoding

Git likes to claim that it's entirely agnostic to the actual data stored in your files, as Git blobs. This is even mostly true. Or, well, half true. Or something. As long as all Git is doing is storing your data, it's completely true! Git just stores bytes. What those bytes mean is up to you.

This story falls apart when you run git diff or git merge, because the diff algorithms, and hence the merge code, are line-oriented. Lines are terminated with newlines. (If you're on a system that uses CRLF instead of newline, well, the second character of a CRLF pair is a newline, so there's no problem here—and Git is OK with an unterminated final line, though this causes some minor bits of heartburn here and there.) If the file is encoded in UTF-16, a lot of bytes tend to appear to be ASCII NULs, so Git just treats it as binary.

This is fixable: Git could decode the UTF-16 data into UTF-8, feed that data through all of its existing line-oriented algorithms (which would now see newline-terminated lines), and then re-encode the data back to UTF-16. There are a bunch of minor technical issues here; the biggest is deciding that some file is UTF-16, and if so, which endianness (UTF-16-LE, or UTF-16-BE?). If the file has a byte order marker, that takes care of the endian issue, and UTF-16-ness could be coded into .gitattributes just as you can currently declare files binary or text, so it's all solvable. It's just messy, and no one has done this work yet.

Footnote-ish: code pages can be considered a (crappy) form of encoding

I mentioned above that the thing we do with Unicode is to encode a 21-bit code point value in some number of eight-bit bytes (1 to 4 bytes in UTF-8, 2 bytes in UTF-16—there's an ugly little trick with what UTF-16 calls surrogates to squeeze 21 bits of value into 16 bits of container, occasionally using pairs of 16-bit values, here). This encoding trick means we can represent all legal 21-bit code point values, though we may need multiple 8-bit bytes to do so.

When we use a code page (CP-number), what we're doing is, or at least can be viewed as, mapping 256 values—those that fit into one 8-bit byte—into that 21-bit code point space. We pick out some subset of no more than 256 such code points and say: These are the code points we'll allow. We encode the first one as, say, 0xa0, the second as 0xa1, and so on. We always leave room for at least a few control codes—usually all 32 in the 0x00 through 0x1f range—and usually we leave the entire 7-bit ASCII subset, as Unicode itself does (see List of Unicode characters), which is why we most typically start at 0xa0.

When one writes proper Unicode support libraries, code pages simply become translation tables, using just this form of indexing. The hard part is making accurate tables for all the code pages, of which there are very many.

The nice thing about code pages is that characters are once again one-byte-each. The bad thing is that you choose your symbol set once, when you say: I use this code page. From then on, you are locked into this small subset of Unicode. If you switch to another code page, some or all of your eight-bit byte values represent different symbols.

Upvotes: 23

jthill
jthill

Reputation: 60625

The short form is adding support for wide characters makes everything harder. Everything that deals with any of the 8-bit ISO code pages or UTF-8 or any of the other MBCS's can scan/span/copy strings without much effort. Try to add support for strings whose transfer encoding contains embedded nulls and the complications to even trivial operations start bloating all your code.

I don't know of any even claimed advantages to UTF-16 that aren't more than undone by the downsides that show up when you actually start using it. You can identify a string boundary in any of ASCII, UTF-8, all 16 ISO/IEC-8859 sets, all the EBCDICs, plus probably a dozen more, with the same simple code. With only slight restrictions (ascii-based, with a few lines added for multiple line terminator conventions) you get basic tokenization, and transliteration to a common internal code page is basically free.

Add UTF-16 support and you just bought yourself a huge amount of added effort and complexity, but all that work enables nothing -- after saying "oh, but now it can handle UTF-16!", what else is now possible with all that added bloat and effort? Nothing. Everything UTF-16 can do, UTF-8 can do as well and usually much better.

Upvotes: 1

VonC
VonC

Reputation: 1330012

The first mention of UTF-8 in the Git codebase dates back to d4a9ce7 (Aug. 2005, v0.99.6), which was about mailingbox patches:

Optionally, with the '-u' flag, the output to .info and .msg is transliterated from its original chaset [sic] to utf-8. This is to encourage people to use utf8 in their commit messages for interoperability.

This was signed by Junio C Hamano / 濱野 純 <[email protected]>.

The character encoding was clarified in commit 3a59e59 (July 2017, Git v2.6.0-rc0):

That "git is encoding agnostic" is only really true for blob objects. E.g. the 'non-NUL bytes' requirement of tree and commit objects excludes UTF-16/32, and the special meaning of '/' in the index file as well as space and linefeed in commit objects eliminates EBCDIC and other non-ASCII encoding.

Git expects bytes < 0x80 to be pure ASCII, thus CJK encoding that partly overlap with the ASCII range are problematic as well. E.g. fmt_ident() removes trailing 0x5C from usernames on the assumption that it is ASCII '\'. However, there are over 200 GBK double byte codes that end in 0x5C.

UTF-8 as default encoding on Linux and respective path translations in the Mac and Windows versions have established UTF-8 NFC as de facto standard for path names.

See "git, msysgit, accents, utf-8, the definitive answers" for more on that last patch.

The most recent version of Documentation/i18n.txt includes:

Git is to some extent character encoding agnostic.

  • The contents of the blob objects are uninterpreted sequences of bytes.
    There is no encoding translation at the core level.

  • Path names are encoded in UTF-8 normalization form C.
    This applies to:

    • tree objects,
    • the index file,
    • ref names, as well as path names in
    • command line arguments,
    • environment variables and
    • configuration files (.git/config, gitignore, gitattributes and gitmodules)

You can see an example of UTF-8 path conversion in commit 0217569 (Jan. 2012, Git v2.1.0-rc0), which added Win32 Unicode file name support.

Changes opendir/readdir to use Windows Unicode APIs and convert between UTF-8/UTF-16.

Regarding command-line arguments, cf. commit 3f04614 (Jan. 2011, Git v2.1.0-rc0), which converts command line arguments from UTF-16 to UTF-8 on startup.


Note: before Git 2.21 (Feb. 2019) the code and tests assume that the system supplied iconv() would always use a BOM in its output when asked to encode to UTF-16 (or UTF-32), but apparently some implementations output big-endian without BOM.
A compile-time knob has been added to help such systems (e.g. NonStop) to add BOM to the output to increase portability.

See commit 79444c9 (12 Feb 2019) by brian m. carlson (bk2204).
(Merged by Junio C Hamano -- gitster -- in commit 18f9fb6, 13 Feb 2019)

utf8: handle systems that don't write BOM for UTF-16

When serializing UTF-16 (and UTF-32), there are three possible ways to write the stream. One can write the data with a BOM in either big-endian or little-endian format, or one can write the data without a BOM in big-endian format.

Most systems' iconv implementations choose to write it with a BOM in some endianness, since this is the most foolproof, and it is resistant to misinterpretation on Windows, where UTF-16 and the little-endian serialization are very common. For compatibility with Windows and to avoid accidental misuse there, Git always wants to write UTF-16 with a BOM, and will refuse to read UTF-16 without it.

However, musl's iconv implementation writes UTF-16 without a BOM, relying on the user to interpret it as big-endian. This causes t0028 and the related functionality to fail, since Git won't read the file without a BOM.

So the "compile-time knob" added here is in the Makefile:

# Define ICONV_OMITS_BOM if your iconv implementation does not write a
# byte-order mark (BOM) when writing UTF-16 or UTF-32 and always writes in
# big-endian format.
#
ifdef ICONV_OMITS_BOM
    BASIC_CFLAGS += -DICONV_OMITS_BOM
endif

Since a NonStop OS and its associated NonStop SQL product always use UTF-16BE (16-bit) encoding for the Unicode (UCS2) character set, you can use ICONV_OMITS_BOM in that environment.

Upvotes: 8

VonC
VonC

Reputation: 1330012

Git support for UTF-16 is coming... for environment variables, with Git 2.20 (Q4 2018)
(and a bug fix in Git 2.21: see the second part of the answer)

See commit fe21c6b, commit 665177e (30 Oct 2018) by Johannes Schindelin (dscho).
Helped-by: Jeff Hostetler (jeffhostetler).
(Merged by Junio C Hamano -- gitster -- in commit 0474cd1, 13 Nov 2018)

mingw: reencode environment variables on the fly (UTF-16 <-> UTF-8)

On Windows, the authoritative environment is encoded in UTF-16.
In Git for Windows, we convert that to UTF-8 (because UTF-16 is such a foreign idea to Git that its source code is unprepared for it).

Previously, out of performance concerns, we converted the entire environment to UTF-8 in one fell swoop at the beginning, and upon putenv() and run_command() converted it back.

Having a private copy of the environment comes with its own perils: when a library used by Git's source code tries to modify the environment, it does not really work (in Git for Windows' case, libcurl, see git-for-windows/git/compare/bcad1e6d58^...bcad1e6d58^2 for a glimpse of the issues).

Hence, it makes our environment handling substantially more robust if we switch to on-the-fly-conversion in getenv()/putenv() calls.
Based on an initial version in the MSVC context by Jeff Hostetler, this patch makes it so.

Surprisingly, this has a positive effect on speed: at the time when the current code was written, we tested the performance, and there were so many getenv() calls that it seemed better to convert everything in one go.
In the meantime, though, Git has obviously been cleaned up a bit with regards to getenv() calls so that the Git processes spawned by the test suite use an average of only 40 getenv()/putenv() calls over the process lifetime.

Speaking of the entire test suite: the total time spent in the re-encoding in the current code takes about 32.4 seconds (out of 113 minutes runtime), whereas the code introduced in this patch takes only about 8.2 seconds in total.
Not much, but it proves that we need not be concerned about the performance impact introduced by this patch.


With Git 2.21 (Q1 2019), the previous path introduced a bug which affected the GIT_EXTERNAL_DIFF command: the string returned from getenv() to be non-volatile, which is not true, that has been corrected.

See commit 6776a84 (11 Jan 2019) by Kim Gybels (Jeff-G).
(Merged by Junio C Hamano -- gitster -- in commit 6a015ce, 29 Jan 2019)

The bug was reported in git-for-windows/git issue 2007:
"Unable to Use difftool on More than 8 File"

$ yes n | git -c difftool.prompt=yes difftool fe21c6b285df fe21c6b285df~100

Viewing (1/404): '.gitignore'
Launch 'bc3' [Y/n]?
Viewing (2/404): 'Documentation/.gitignore'
[...]
Viewing (8/404): 'Documentation/RelNotes/2.18.1.txt'
Launch 'bc3' [Y/n]?
Viewing (9/404): 'Documentation/RelNotes/2.19.0.txt'
Launch 'bc3' [Y/n]? error: cannot spawn ¦?: No such file or directory
fatal: external diff died, stopping at Documentation/RelNotes/2.19.1.txt

Hence:

diff: ensure correct lifetime of external_diff_cmd

According to getenv(3)'s notes:

The implementation of getenv() is not required to be reentrant.
The string pointed to by the return value of getenv() may be statically allocated, and can be modified by a subsequent call to getenv(), putenv(3), setenv(3), or unsetenv(3).

Since strings returned by getenv() are allowed to change on subsequent calls to getenv(), make sure to duplicate when caching external_diff_cmd from environment.

This problem becomes apparent on Git for Windows since fe21c6b (mingw: reencode environment variables on the fly (UTF-16 <-> UTF-8)), when the getenv() implementation provided in compat/mingw.c was changed to keep a certain amount of alloc'ed strings and freeing them on subsequent calls.


Git 2.24 (Q4 2019) fix a hack introduced previously.

See commit 2049b8d, commit 97fff61 (30 Sep 2019) by Johannes Schindelin (dscho).
(Merged by Junio C Hamano -- gitster -- in commit 772cad0, 09 Oct 2019)

Move git_sort(), a stable sort, into libgit.a

The qsort() function is not guaranteed to be stable, i.e. it does not promise to maintain the order of items it is told to consider equal.
In contrast, the git_sort() function we carry in compat/qsort.c is stable, by virtue of implementing a merge sort algorithm.

In preparation for using a stable sort in Git's rename detection, move the stable sort into libgit.a so that it is compiled in unconditionally, and rename it to git_stable_qsort().

Note: this also makes the hack obsolete that was introduced in fe21c6b (mingw: reencode environment variables on the fly (UTF-16 <-> UTF-8), 2018-10-30, Git v2.20.0-rc0), where we included compat/qsort.c directly in compat/mingw.c to use the stable sort.

Upvotes: 1

Related Questions